ENERGY-INSPIRED SELF-SUPERVISED PRETRAINING FOR VISION MODELS

Abstract

Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradientbased optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling.

1. INTRODUCTION

The recent rapid development of computation hardware and deep network architectures have paved the way for learning very large deep networks that match and even exceed human intelligence on addressing complex tasks (Brown et al., 2020; He et al., 2017; Silver et al., 2016) . However, as annotating data remains costly, leveraging unlabeled data to facilitate the learning of very large models attracts increasing attention. Exploiting context information in the massive unlabeled data in natural language processing (NLP) stimulates Chen et al. (2020a) to use the direct modeling of pixel sequences as the pre-text tasks of vision model pretraining. Recent self-supervised vision model pretraining through masked image modeling (MIM) (He et al., 2022; Wei et al., 2021; Xie et al., 2022) typically adopt an auto-encoder (AE) architecture, where the target vision model to be pretrained serves as an encoder to encode an image with incomplete pixel information to a latent representation. An auxiliary decoder is jointly trained to restore the missing information from the latent representation. Contrastive self-supervised learning methods (Chen et al., 2020b) usually require large training batch sizes to provide sufficient negative samples. Recent Siamese network based self-supervised learning methods (Grill et al., 2020; Chen & He, 2021; Tian et al., 2021; He et al., 2020; Chen et al., 2021) In this paper, we make a further step towards the following question: Can we train a standard deep network to do both representation encoding and masked prediction simultaneously, so that no auxiliary components, heavy data augmentations, or modifications to the network structure are required?



alleviate the huge batch challenge by deploying an momentum copy of the target model to facilitate the training and prevent trivial solutions. VICReg (Bardes et al., 2022) prevents feature collapsing by two explicit regularization terms. Barlow Twins (Zbontar et al., 2021) reduce the need of large batch size or Siamese networks by proposing a new objective based on cross-correlation matrix between features of different image augmentations.

