CAN: A SIMPLE, EFFICIENT AND SCALABLE CON-TRASTIVE MASKED AUTOENCODER FRAMEWORK FOR LEARNING VISUAL REPRESENTATIONS

Abstract

We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. For instance, when pre-training ViT-B models on the curated ImageNet dataset, CAN achieves 74.8% top-1 linear probing accuracy, an absolute improvement of 6.8% over MAE and 1.3% over SimCLR with the same architecture and data augmentations. CAN is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. Finetuning our ViT-L model on ImageNet attains 86.1% top-1, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models 1 .

1. INTRODUCTION

Self-supervised learning promises continued advances in the state of the art by enabling the use of increasingly large models and datasets. However, interest in larger datasets has precipitated an increased reliance on web-scraped data collection processes, which result in heterogeneous and "uncurated" datasets (Yu et al., 2022; Radford et al., 2021; Jia et al., 2021) . Extreme image heterogeneity has made scaling vision models to uncurated datasets a non-trivial challenge (Tian et al., 2021; Cole et al., 2022) . There are two families of self-supervised methods for images which have both proven highly effective on curated datasets (e.g., ImageNet), and are therefore natural candidates for scaling to large, uncurated data. First, masked image models such as the masked autoencoder (MAE) (He et al., 2022) are a nascent set of methods based on a mask-and-reconstruct training mechanism. This classical idea (Ballard, 1987) is enjoying a rejuvenation thanks to favourable efficiency when combined with the vision transformer architecture (Dosovitskiy et al., 2021b) . Second, contrastive learning (van den Oord et al., 2018; Chen et al., 2020b; He et al., 2020) trains an encoder to distinguish between pairs of positive samples generated with data augmentations and negative pairs sampled at random. Both approaches have proven to be very powerful self-supervised methods.

Contrastive learning and masked autoencoders (MAE) employ very different learning mechanisms:

the former train the encoder to be invariant to semantics-preserving data variations, while MAEs learn spatial statistical correlations. Furthermore, MAE methods treat each sample independently in 1 Code will be released soon. the loss function, while contrastive methods explicitly look at the relationship between all samples in the batch, by either reducing or increasing embedding distance. Given this, we hypothesize that these two approaches are complementary, extracting different discriminative features for a given input. If this hypothesis holds, then we expect to see improved performance on various downstream tasks based on the extracted features. This motivates our exploration of a combined method. Further, inspired by advances in diffusion models (Ho et al., 2020; Song et al., 2021) , we introduce a third loss based on noise prediction during the masked autoencoder reconstruction. We add Gaussian noise to unmasked input patches, and train the model to predict the noise added to each patch. Denoising encourages the encoder to extract higher-frequency information from the input, while autoencoder reconstructions tend to focus on low-frequency information (Hou et al., 2017) . This additional loss has two purposes: it improves downstream performance; and it addresses a source of wasted computation in MAE with a negligible impact on FLOPs: that reconstruction of unmasked patches is thrown away unused. Combining these ingredients we present CAN, a minimal fusion of contrastive learning, masked autoencoders and denoising diffusion training loss. Our method enjoys stronger performance than its constituent parts do on their own, especially pronounced benefits on more uncurated datasets such as JFT-300M, which contains 300 million highly heterogeneous images, often containing artifacts (e.g., watermarks). For instance, evaluating JFT-trained ViT-L models using the top-1 accuracy of an ImageNet-trained linear probe, MAE achieves 64.1% and SimCLR achieves 73.4%, while CAN achieves 75.4%. CAN masks 50% of patches in each view, making it significantly more scalable than prior contrastive methods that use two full image views. Our contributions are: 1. We present CAN, a simple self-supervised learning algorithm with good scaling properties, making it suitable for training on very large image datasets, such as the JFT-300M dataset. 2. CAN is much more efficient than SimCLR (Figure 1 ). For instance, SimCLR uses 70% more FLOPs than CAN with ViT-L models. 3. CAN is more robust to distribution shifts than MAE or SimCLR, and performs better on a wide range of few-shot and linear transfer tasks. et al., 2018) . Notably MAE (He et al., 2022) showed that classical masked autoencoding approaches could be used to pre-train ViTs without passing masked tokens through the encoder. This provides a significant efficiency boost; our method similarly takes advantage of this.



Figure 1: Left: CAN scales better than SimCLR since it uses masked inputs. Middle and right: CAN outperforms SimCLR and MAE on ImageNet linear probe and finetune evaluations for ViT-L models when pre-training on uncurated data such as JFT-300M.

Masked image models with Vision Transformers. The advent of the Vision Transformer (ViT)(Dosovitskiy et al., 2021b)  provoked a focused effort to develop strong self-supervised learning frameworks that use ViT backbones. Works such asDINO (Caron et al., 2021)  andMoCo-v3 (Chen  et al., 2021b)  demonstrated that techniques developed with ConvNet backbones in mind could also perform competitively using ViTs after proper tuning to suit the new architecture. ViT-specific methods have emerged since then, particularly masked image modelling(Bao et al., 2022; Chen  et al., 2022; Xie et al., 2022), which takes inspiration from pre-training methods used in NLP (Devlin

annex

Under review as a conference paper at ICLR 2023 

