CAN: A SIMPLE, EFFICIENT AND SCALABLE CON-TRASTIVE MASKED AUTOENCODER FRAMEWORK FOR LEARNING VISUAL REPRESENTATIONS

Abstract

We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. For instance, when pre-training ViT-B models on the curated ImageNet dataset, CAN achieves 74.8% top-1 linear probing accuracy, an absolute improvement of 6.8% over MAE and 1.3% over SimCLR with the same architecture and data augmentations. CAN is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. Finetuning our ViT-L model on ImageNet attains 86.1% top-1, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models 1 .

1. INTRODUCTION

Self-supervised learning promises continued advances in the state of the art by enabling the use of increasingly large models and datasets. However, interest in larger datasets has precipitated an increased reliance on web-scraped data collection processes, which result in heterogeneous and "uncurated" datasets (Yu et al., 2022; Radford et al., 2021; Jia et al., 2021) . Extreme image heterogeneity has made scaling vision models to uncurated datasets a non-trivial challenge (Tian et al., 2021; Cole et al., 2022) . There are two families of self-supervised methods for images which have both proven highly effective on curated datasets (e.g., ImageNet), and are therefore natural candidates for scaling to large, uncurated data. First, masked image models such as the masked autoencoder (MAE) (He et al., 2022) are a nascent set of methods based on a mask-and-reconstruct training mechanism. This classical idea (Ballard, 1987) is enjoying a rejuvenation thanks to favourable efficiency when combined with the vision transformer architecture (Dosovitskiy et al., 2021b) . Second, contrastive learning (van den Oord et al., 2018; Chen et al., 2020b; He et al., 2020) trains an encoder to distinguish between pairs of positive samples generated with data augmentations and negative pairs sampled at random. Both approaches have proven to be very powerful self-supervised methods. Contrastive learning and masked autoencoders (MAE) employ very different learning mechanisms: the former train the encoder to be invariant to semantics-preserving data variations, while MAEs learn spatial statistical correlations. Furthermore, MAE methods treat each sample independently in 1 Code will be released soon. 1

