CAN: A SIMPLE, EFFICIENT AND SCALABLE CON-TRASTIVE MASKED AUTOENCODER FRAMEWORK FOR LEARNING VISUAL REPRESENTATIONS

Abstract

We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. For instance, when pre-training ViT-B models on the curated ImageNet dataset, CAN achieves 74.8% top-1 linear probing accuracy, an absolute improvement of 6.8% over MAE and 1.3% over SimCLR with the same architecture and data augmentations. CAN is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. Finetuning our ViT-L model on ImageNet attains 86.1% top-1, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models 1 .

1. INTRODUCTION

Self-supervised learning promises continued advances in the state of the art by enabling the use of increasingly large models and datasets. However, interest in larger datasets has precipitated an increased reliance on web-scraped data collection processes, which result in heterogeneous and "uncurated" datasets (Yu et al., 2022; Radford et al., 2021; Jia et al., 2021) . Extreme image heterogeneity has made scaling vision models to uncurated datasets a non-trivial challenge (Tian et al., 2021; Cole et al., 2022) . There are two families of self-supervised methods for images which have both proven highly effective on curated datasets (e.g., ImageNet), and are therefore natural candidates for scaling to large, uncurated data. First, masked image models such as the masked autoencoder (MAE) (He et al., 2022) are a nascent set of methods based on a mask-and-reconstruct training mechanism. This classical idea (Ballard, 1987) is enjoying a rejuvenation thanks to favourable efficiency when combined with the vision transformer architecture (Dosovitskiy et al., 2021b) . Second, contrastive learning (van den Oord et al., 2018; Chen et al., 2020b; He et al., 2020) trains an encoder to distinguish between pairs of positive samples generated with data augmentations and negative pairs sampled at random. Both approaches have proven to be very powerful self-supervised methods. Contrastive learning and masked autoencoders (MAE) employ very different learning mechanisms: the former train the encoder to be invariant to semantics-preserving data variations, while MAEs learn spatial statistical correlations. Furthermore, MAE methods treat each sample independently in the loss function, while contrastive methods explicitly look at the relationship between all samples in the batch, by either reducing or increasing embedding distance. Given this, we hypothesize that these two approaches are complementary, extracting different discriminative features for a given input. If this hypothesis holds, then we expect to see improved performance on various downstream tasks based on the extracted features. This motivates our exploration of a combined method. Further, inspired by advances in diffusion models (Ho et al., 2020; Song et al., 2021) , we introduce a third loss based on noise prediction during the masked autoencoder reconstruction. We add Gaussian noise to unmasked input patches, and train the model to predict the noise added to each patch. Denoising encourages the encoder to extract higher-frequency information from the input, while autoencoder reconstructions tend to focus on low-frequency information (Hou et al., 2017) . This additional loss has two purposes: it improves downstream performance; and it addresses a source of wasted computation in MAE with a negligible impact on FLOPs: that reconstruction of unmasked patches is thrown away unused. Combining these ingredients we present CAN, a minimal fusion of contrastive learning, masked autoencoders and denoising diffusion training loss. Our method enjoys stronger performance than its constituent parts do on their own, especially pronounced benefits on more uncurated datasets such as JFT-300M, which contains 300 million highly heterogeneous images, often containing artifacts (e.g., watermarks). For instance, evaluating JFT-trained ViT-L models using the top-1 accuracy of an ImageNet-trained linear probe, MAE achieves 64.1% and SimCLR achieves 73.4%, while CAN achieves 75.4%. CAN masks 50% of patches in each view, making it significantly more scalable than prior contrastive methods that use two full image views. Our contributions are: 1. We present CAN, a simple self-supervised learning algorithm with good scaling properties, making it suitable for training on very large image datasets, such as the JFT-300M dataset. 2. CAN is much more efficient than SimCLR (Figure 1 ). For instance, SimCLR uses 70% more FLOPs than CAN with ViT-L models. 3. CAN is more robust to distribution shifts than MAE or SimCLR, and performs better on a wide range of few-shot and linear transfer tasks.

2. RELATED WORK

Masked image models with Vision Transformers. The advent of the Vision Transformer (ViT) (Dosovitskiy et al., 2021b) provoked a focused effort to develop strong self-supervised learning frameworks that use ViT backbones. Works such as DINO (Caron et al., 2021) and MoCo-v3 (Chen et al., 2021b) demonstrated that techniques developed with ConvNet backbones in mind could also perform competitively using ViTs after proper tuning to suit the new architecture. ViT-specific methods have emerged since then, particularly masked image modelling (Bao et al., 2022; Chen et al., 2022; Xie et al., 2022) , which takes inspiration from pre-training methods used in NLP (Devlin et al., 2018) . Notably MAE (He et al., 2022) showed that classical masked autoencoding approaches could be used to pre-train ViTs without passing masked tokens through the encoder. This provides a significant efficiency boost; our method similarly takes advantage of this. Contrastive learning in computer vision. Self-supervision has received significant attention in computer vision as it offers a way to extract general purpose features without supervision. In particular, contrastive learning (van den Oord et al., 2018; Hénaff et al., 2020; Chen et al., 2020b; He et al., 2020; Tian et al., 2020; Chuang et al., 2020; Hénaff et al., 2021) has achieved state of the art performance by enforcing invariance to augmentations, whilst using negative samples (Robinson et al., 2021a; Ge et al., 2021) to avoid trivial solutions by spreading the embedding out uniformly on the sphere (Wang & Isola, 2020) . The contrastive pre-training task is conceptually very different from masked image models such as MAE, which learn spatial statistical dependencies. Another distinction is that autoencoders encourage information preservation in latent representations, whilst contrastive learning could suppress features (Chen et al., 2021a; Robinson et al., 2021b) . This leads us to hypothesize that the two approaches learn different data features, and may therefore be complementary learning mechanisms. This motivates us to combine contrastive learning and masked image modelling so as to develop a reinforced pre-training task that enjoys the merits of each. Denoising diffusion models. Denoising autoencoders (DAE) (Vincent et al., 2010) learn to reconstruct clean data given a noisy input. By learning to map low-density data regions to high-density regions, DAE learns the shape of the data manifold. This connection was made precise by Vincent (2011) , who showed that DAEs learn the score-function s(x) = ∇ x log p(x). This key observation underpins the significant recent advances in generative diffusion models, which use an estimate of the score-function to generate samples (Ho et al., 2020; Song et al., 2021) . The recent success of DAEs in generative modelling has not yet translated to representation learning, with some exceptions (Asiedu et al., 2022; Zaidi et al., 2022) . In this work we exploit a denoising autoencoder to eliminate the MAE inefficiency of reconstructing unmasked patches but never using them. Concurrent work. Several recent works propose approaches that combine ideas from masked image modelling and Siamese self-supervised learning. For instance, Huang et al. (2022) propose a combination of contrastive and masked reconstruction objectives using one masked view, and one full (unmasked) view. Other recent works (Tao et al., 2022; Chen et al., 2022; Assran et al., 2022) use similar asymmetric designs. The key distinction between CAN and concurrent work is that we strike a different balance between simplicity, efficiency, and performance: we focus on developing a simple, efficient and symmetric method: we use two masked views and no momentum encoder. We hope the simplicity and efficiency of CAN will make it easy to adapt and modify in future work.

3. A SIMPLE CONTRASTIVE MASKED AUTOENCODER FRAMEWORK

Our approach is a minimal synthesis of contrastive learning, the masked autoencoder (He et al., 2022) , and the denoising loss used in the training of diffusion models. We focus on simplicity and scalability, aiming to design a hybrid with as few complex or costly components as possible. We also aim to minimize wasted computation: in particular, the MAE decoder requires reconstructions of all patches, but only those of masked patches are used in the loss. Below, first we detail the basic pipeline of generating views and passing masked inputs through the encoder and decoder. Then we explain the three different objectives we use: contrastive, reconstruction, and denoising. The penultimate section describes the combined objective, and the final section discusses scalability.

3.1. OVERVIEW OF METHOD

Given a batch of n images {x} n i=1 , we generate two views x 1 i , x 2 i ∈ R h×w×3 of each image without supervision using the same data augmentations as Chen et al. (2020b) . Each image is then split into T = (h/p) × (w/p) patches of size p × p: x 1 i,patch , x 2 i,patch ∈ R T ×p×p×3 in preparation for input to the ViT encoder. We always assume that p divides h and w. Two masks M 1 i , M 2 i ∈ {0, 1} T are independently generated, with a 1 in coordinate t ∈ {1, . . . T } indicating that the tth patch is masked. Each patch is left unmasked independently with probability m, conditioned on always having exactly T = m • T patches unmasked, which we assume is an integer. In all CAN experiments our default masking rate is m = 50% unless explicitly stated otherwise (for MAE we use the default 75%). Following He et al. (2022) , only the T unmasked patches are passed to the ViT encoder, which processes the two views in parallel. Masking a large fraction of patches from both views make our method much more efficient (see Table 1 ) than contrastive methods that use two full views, and recent works that use one full view and one masked view (Assran et al., 2022; Huang et al., 2022) . Finally, we collect the embeddings of unmasked tokens z 1 i , z 2 i ∈ R T ×d and Figure 2 : The CAN framework: Two views of an image are generated, 50% of patches randomly masked in each, and noise is added to patches. An encoder is trained to solve three tasks: 1) Reconstruction: encoded patches are passed to a decoder that reconstructs missing patches, 2) Denoise: reconstructs the noise added to unmasked patches, and 3) Contrast: pooled patches are passed to a contrastive loss, using in-batch samples as negatives (Chen et al., 2020b) . reshape into T ×d tensors by adding a learned [M] embedding to positions corresponding to masked tokens. The result is passed through a comparatively lightweight ViT decoder to produce outputs x1 i , x2 i in image space R h×w×3 .

3.2. CONTRASTIVE LEARNING OBJECTIVE

The embeddings z 1 i , z 2 i ∈ R T ×d returned by the encoder are pooled via a simple mean along the first dimension to form d-dimensional embeddings, which are passed through a lightweight MLP projection head that maps into a lower dimension space R r , r < d, and normalized to unit length to produce embeddings u 1 i , u 2 i ∈ R r for i = 1, . . . n. For the ith batch item we collect the other 2n -2 samples in-batch N i = {u 1 j , u 2 j } j =i to use as negatives, and compute the InfoNCE loss: L InfoNCE = 1 2n v=1,2 n i=1 -log e u 1 i u 2 i /τ e u 1 i u 2 i /τ + u -∈Ni e u v i u -/τ where τ > 0 is a temperature parameter, which we set to τ = 0.1 by default.

3.3. PATCH RECONSTRUCTION OBJECTIVE

The outputs x1 i , x2 i , i = 1, . . . , n of the ViT decoder are trained to reconstruct the missing patches of each image. Corroborating the findings of He et al. (2022) , we find it best to only compute the reconstruction loss on masked patches: L rec = 1 2n v=1,2 n i=1 M v i • (x v i -xv i ) 2 2 where • multiplies all pixels in the tth patch of the residual image x v i -xv i by (M v i ) t ∈ {0, 1}. Whilst computing the loss only on masked patches gives better performance, it indicates wasted computation since the decoder also produces reconstructions for unmasked patches. To avoid waste we propose an alternative objective specifically for unmasked patches, which we discuss next.

3.4. DENOISING OBJECTIVE

Inspired by the significant advances in diffusion modelling using denoising training objectives (Ho et al., 2020; Kingma et al., 2021) and their equivalent score-based counterparts (Song et al., 2021; Vincent, 2011) we revisit the suitability of denoising for self-supervised learning. We add independent isotropic Gaussian noise to each image x v i ← x v i + σ v i e v i with e v i ∼ N (0, I) and σ v i uniformly sampled from an interval [0, σ max ]. This noisy input is masked and passed to the encoder as described in Section 3.1. When passing encoded patches to the decoder we make a small addition to the method in Section 3.1 to provide the decoder with information on the noise level σ v i to help it separate noise from the ground truth image. This is motivated by denoising diffusion methods, which pass both the noisy image and the noise level as inputs to the denoising model (Ho et al., 2020) . We achieve this by using σ v i as a positional encoding in the decoder, similarly to Vaswani et al. (2017) . First we produce a sinusoidal embedding of σ v i ∈ R d , which is passed through a lightweight 2 layer MLP with ReLU activations of constant width d to produce a (learnable) embedding p v i ∈ R d , whose dimension matches the latent dimension of z v i ∈ R T ×d . We add the result to each embedded token (including missing tokens [M]) to provide noise-level information: (z v i ) t ← (z v i ) t + p v i for t = 1 . . . , T , and pass the result to the decoder producing x v i . We define our denoising loss function, which is computed only on unmasked pixels: where, • is as defined in Section 3.2. Note that this denoising loss is extremely lightweight, introducing only a very small overhead due to the MLP. We emphasize that the reconstruction of noise patches comes at zero additional cost since the decoder produces reconstructions of all patches, both masked and unmasked, even though only reconstructions of masked patches are used in L rec . Finally, it has often been observed in the diffusion modelling literature that although it is equivalent to train a denoising model to estimate the noise, or to estimate the clean input itself (Vincent, 2011) , there is a big empirical gap between the two, with noise prediction faring better. While we do not pursue it further, our testing corroborates this. L denoise = 1 2n v=1,2 n i=1 (1 -M v i ) • (σ v i e v i -xv i ) 2 2 None +noise +noise, +loss Full 67.9 68.6 68.4 68.9 Table 1 : Ablating components of the denoising objective. "Full" denotes the entire method as described in Section 3.4 Ablation: Table 1 studies the effect of each of the components of the denoising method. We use ViT-B models trained for 100 epochs on ImageNet, and consider four settings, each adding in more parts of the method: 1) CAN with no denoising, 2) adding noise to the input only, 3) adding noise and using the denoising loss, and 4) the full method with all of the described components, including using σ v i as a positional encoding in the decoder. Results show that simply adding noise as a data augmentation improves performance by 0.7%, which can be improved to 1% by adding a reconstruction loss with noise level passed as an argument. The noise level argument is necessar: the reconstruction loss without noise level argument performs worse (68.4%) than noise with no reconstruction at all (68.6%). We emphasize that the improvement from denoising comes at minimal cost to run time and memory during training, since it uses reconstructions produced by the decoder, which in the case of MAE are simply thrown away unused. Denoising prediction encourages the encoder to extract high-frequency features, which we hypothesize is complementary to reconstruction and contrastive tasks.

3.5. THE COMBINED OBJECTIVE FUNCTION

The overall CAN objective trains the encoder and decoder to optimize three losses combined: L CAN = λ InfoNCE L InfoNCE + λ rec L rec + λ denoise L denoise where 0 ≤ λ InfoNCE , λ rec , λ denoise , and λ InfoNCE + λ rec + λ denoise = 1 weight the objectives. In practice we parameterize the weights by eliminating one variable using the equality constraint, taking: λ rec = (1 -λ InfoNCE ) • λ and λ denoise = (1 -λ InfoNCE ) • (1 -λ) where 0 ≤ λ ≤ 1. This parameterization makes it easy to control the relative weighting between the two reconstruction losses L rec , L denoise on the one hand, and the contrastive loss L InfoNCE on the other. Empirically we find that performance is very robust to the choice of λ, and many choices of λ InfoNCE also work well (see Section 5).

3.6. DISCUSSION ON EFFICIENCY

The goal of this work is to propose a conceptually minimal combined contrastive masked autoencoder approach, aiming to find better trade-offs between simplicity, efficiency, and performance. Consequently, we choose to omit a number of commonly used self-supervised learning design components. For instance, we do not use a momentum target network or multiple views (multi-crop), since they both increase memory requirements and run time. Even without these commonly used components, our minimal framework achieves very strong performance compared to prior work, and importantly improves performance over its contrastive and autoencoder constituent parts. We expect that a wide range of modifications, such as momentum target networks (He et al., 2020) and multi-crop (Caron et al., 2020) , will improve performance further on top of the core method. ). The linear probe performance of CAN is 74.8% using ViT-B, beating all masked image modelling methods, the best of which is CAE with 70.4% (Chen et al., 2022) . CAN is only outperformed by MoCo-v3 and DINO, both of which use momentum encoders and two full image views, and in the case of DINO a further 10 multi-crop views. Note that the masked image column indicates whether a method uses one or more full image views as input to the model, and the no additional parameters column indicates whether a method relies on other parameters besides the main encoder, e.g., from a pre-trained tokenizer, or a momentum updated target encoder. We also report results for our MAE implementation, which approximately matches the original numbers reported by He et al. (2022) , validating our MAE results on JFT-300M.

4.3. FEW-SHOT LEARNING

We use linear probes to evaluate suitability of CAN for few-shot learning, following the protocol of Dosovitskiy et al. (2021a) . We use the models pre-trained on JFT-300M for 5000 epochs whose ImageNet performance is recorded in Figure 1 . Results in Figure 4 for few-shot transfer learning on 9 other datasets show that the superior performance on IN-1K translates to strong performance on other tasks. We also note that our 25-shot ViT-L models beat full-shot both DnC and BYOL ResNet50 models (also trained for 5000 epochs on JFT-300M) on 6 out of 8 datasets (Tian et al., 2021) . See Appendix A for many additional results for different training schedules and model sizes.

4.4. ROBUSTNESS TO DISTRIBUTION SHIFT

Finally, we consider the robustness of CAN to distribution shifts. Figure 5 reports results on the following 7 validation sets, which cover a large variety of distribution shifts: original IN-1K (Deng et al., 2009) , IN-v2 (Recht et al., 2019) , IN-ReaL (Beyer et al., 2020) , IN-Adversarial (Hendrycks et al., 2021b) , IN-Rendition (Hendrycks et al., 2021a) , Object-Net (Barbu et al., 2019) . CAN performs favourably under both JFT-300M and IN-1K pre-training, beating SimCLR and MAE baselines in nearly all cases. See Appendix A for additional results.

5. HYPERPARAMETER ANALYSIS

We study the different components of CAN to better understand the effect of the different mechanisms, and to determine optimal parameter configurations. All ablations use ViT-B models trained for 100 epochs on IN-1K, unless explicitly said otherwise. We use the best loss weights and noise level in these experiments for experiments in Section 4. 68.9 Table 5 : Ablating CAN: We remove each of the three loss terms in CAN one by one. Contrastive loss weight. We vary the weighting λ InfoNCE used to weight the contribution of the contrastive and reconstruction losses. Recall that larger λ InfoNCE places higher weight on the contrastive loss. Results in Figure 7 show that the best weight is λ InfoNCE = 0.03, which approximately balances the magnitudes of the two terms (see Table 4 ). Denoising loss weight and noise level. We study the noise level interval [0, σ max ] from which to sample input noise, and the weight λ balancing the denoising and reconstruction losses. Results in Fig 7 show that the best maximum noise level is σ max = 0.05, and that similar performance is attained for a number of different weights on the denoising loss. Ablating CAN: CAN is comprised of three components: (C) contrastive, (A) masked autoencoder, and (N) de-noising losses. We ablate each of the three components in Table 5 , setting the loss weight to zero to "remove" a component. We use ViT-B models pre-trained for 100 epochs. Removing any component leads to worse performance, with contrastive loss hurting the most.

6. DISCUSSION

We present CAN, a simple, efficient and scalable self-supervised method for visual representation learning. CAN combines ideas from contrastive learning, masked autoencoding, and diffusion denoising into a single high-performing method. Extensive empirical results show that CAN scales with minimal changes to the large uncurated datasets, providing a significant boost over SimCLR and MAE methods on a wide range of downstream tasks and evaluations, including linear probes, few-shot, robustness, and finetuning. Our results suggests that contrasting and reconstruction are complementary principles that can mutually reinforce one another. Few shot: Section 4.3 reports 10and 25-shot results for ViT-L models pre-trained on JFT-300M for 5000 epochs. Here we report 1and 5-shot results for the same models in Figure 10 . We additionally show the full set of {1, 5, 10, 25}-shot results for ViT-L models pre-trained on JFT-300M for 800 and 1600 epochs (Figures 11 and 12 respectively), ViT-B models pre-trained on JFT-300M for 800 epochs (Figure 13 ), and ViT-L models pre-trained on IN-1K for 800 epochs (Figure 14 ). We make a number of observations. 

B RUNTIME OF CAN COMPARED TO DNC

In the main paper we estimate our method is significantly faster than DnC (Tian et al., 2021) . We determined this approximate comparison from the following two pieces of information: 1) DnC reports that 3000 ImageNet epochs takes 29 hours on 512 TPUs for a ResNet-50 model ( 25M parameters), and 2) 3000 ImageNet epochs of CAN take 78 hours on 64 TPUs for a ViT-L model ( 300M parameters). We assume a linear relationship between number of TPUs and runtime. Under this assumption, we estimate that CAN would take approximately 10 hours to train with 512 TPUs, compared to the 29 hours reported by Tian et al. (2021) for a model with 1/10th the number of parameters. We emphasize that this is far from an exact comparison and is only intended as a very approximate guide.

C HYPERPARAMETER SETTINGS

We list hyperparameters used for CAN pre-training in Table 6 and Table 7 . For preprocessing we closely follow SimCLR Chen et al. (2020b) . We use the same hyperparameters for SimCLR pre-



Code will be released soon.



Figure 1: Left: CAN scales better than SimCLR since it uses masked inputs. Middle and right: CAN outperforms SimCLR and MAE on ImageNet linear probe and finetune evaluations for ViT-L models when pre-training on uncurated data such as JFT-300M.

Figure 3: Denoising: Both the encoded patches and the noise level σ are passed to the decoder by passing σ through an MLP, and adding the result to each embedded token.

Figure 5: Robustness: Evaluating performance under distribution shifts with respect to models finetuned on IN-1K. Validation performance of ViT-L models is reported on 7 different datasets.

Figure 6: CAN and SimCLR with different masking rates. ViT-B models are pre-trained for 100 epochs on IN-1K (left), and 800 epochs on JFT-300M (right).

Figure 9: Top: ViT-B models pre-trained on JFT-300M for 800 epochs, evaluated on 7 datasets with distribution shifts from IN-1K. Bottom: Comparison of our JFT-300M pre-trained ViT-B model to training ViT-B from scratch on IN-1K. We compare to standard supervised cross-entropy training with Mixup, and to PyramidAT (Herrmann et al., 2022), which uses an adversarial training method. CAN considerably outperforms supervised training, and beats PyramidAT in 6 out of 7 cases without requiring adversarial training.

Figure 14: Few shot: ViT-L models pre-trained on IN-1K for 800 epochs are evaluated on 9 fewshot learning tasks.

Figure 16: Few shot: ViT-L models pre-trained on IN-21K for 800 (IN-1K equivalent) epochs are evaluated on 9 few-shot learning tasks.

JFT-300M pre-training: Comparison to the state of the art on ImageNet linear probe. CAN outperforms all methods except DnC, which uses a complicated multi-stage training process. Computation is measured as ImageNet-equivalent epochs. Models are evaluated using linear probe and finetuning on IN-1K. All hyperparameers were tuned on IN-1K, besides learning rate and weight decay which we cut by a factor of 4 and 2 respectively to stabilize training on JFT-300M. See Appendix C and Section 5 for details.Results. Figure1compares CAN to SimCLR and MAE baselines using ViT-L models. CAN achieves a much better trade-off between efficiency (measured in FLOPs) and performance using ViT-L models for all three methods: CAN uses 41% fewer FLOPs than SimCLR and consistently outperforms SimCLR and MAE: for training ViT-L models for 5000 epochs, CAN achieves an IN-1K linear probe performance of 75.4%, compared to 71.8% for SimCLR and 64.1% for MAE.

Pre-training on ImageNet-1K. † Our implementation. *Quoted from Chen et al. (2022).

Few-shot: ViT-L models pre-trained on JFT-300M for 5000 epochs are evaluated on 9 datasets in few-shot setting (10-shot and 25-shot). CAN outperforms MAE and SimCLR.

Figure7: ViT-B models pre-trained on IN-1K for 100 epochs. Left: The best contrastive loss weight is small but non-negative. Middle: A wide range of σ max values improve over no-noise. Right: Performance is not sensitive to the denoising loss weight.Complementarity of contrastive and reconstruction losses.A key hypothesis motivating our work is that contrastive learning and masked autoencoder reconstruction may not only be compatible training objectives, but are complementary ones. Table 4 compares the final training value of the contrastive L InfoNCE and reconstruction L rec when jointly trained (i.e., CAN) compared to only optimizing L InfoNCE (SimCLR) or only L rec (MAE). The results support the hypothesis: joint training achieves a lower loss on both objectives compared to individual training. Complementary training: All methods use 50% masking for fair comparison. CAN training achieves lower training loss for both contrastive and reconstruction than individual training. Masking rate. Figure 6 reports the behavior of CAN and SimCLR under different masking rates on IN-1K and JFT-300M pre-training (for JFT-300M we use 800 epochs). The performance of SimCLR decreases as the masking rate increases, suggesting that masking is not an effective data augmentation. In contrast, performance of CAN peaks at a non-zero masking rate, but at a much lower rate than the 75% used by MAE on IN-1K. This occurs since very low masking rates are preferred by the contrastive part of CAN, but severely damage the autoencoder part, which can learn trivial solutions. The considerable efficiency improvement from masking 50% of patches more than compensates for the small drop in performance for a fixed number of epochs.

Few shot: ViT-L models pre-trained on JFT-300M for 5000 epochs evaluated on 9 fewshot learning tasks. Results accompany the 10-and 25-shot results in Figure4.

Figure 15: Robustness: ViT-L models pre-trained on IN-21K for 800 (IN-1K equivalent) epochs are first finetuned on IN-1K. The models are then evaluated on 7 test datasets with different distribution shifts from IN-1K.

A ADDITIONAL TRANSFER LEARNING RESULTS

We report additional results for few-shot learning and robustness.Robustness: Sections 4.4 reports robustness results for ViT-L models pre-trained on JFT-300M for 5000 epochs, and ViT-L models pre-trained on IN-1K for 800 epochs. In both cases we report the performance of the models after finetuning on IN-1K.Here we report the same robustness results for ViT-L models trained on JFT-300M for 1600 and 800 epochs (Figure 8 ), and ViT-B models pre-trained for 800 epochs (Figure 9 ). Figure 9 also compares our ViT-B model to ViT-B models trained from scratch on ImageNet. We find that our model is considerably more robust than training with cross-entropy and Mixup from scratch, and also outperforms PyramidAT (Herrmann et al., 2022) , an adversarial training method that introduces significant overheads compared to standard cross-entropy training. We emphasize that here there are two differences in the training: a) the training algorithm itself, and b) the data seen by the model. Our model sees extra JFT-300M data not seen by the other two approaches. This means that the methods are not exactly comparable. It is, however, a realistic setting showing the benefits to robustness of pre-training on large datasets. training. For MAE pre-training, we use the same hyperparameters as listed in He et al. (2022) , except for the use of Glorot uniform initialization instead of LeCun initialization as done in He et al. (2022) . We found that this provided better performance for our JAX-based MAE implementation. Table 10 lists the hyperparameters for finetuning evaluations. We use the same set of hyperparameters for each finetuning each pre-training method, and for both ViT-B and ViT-L model sizes. For linear probing we list the hyperparameters in Table 11 for which we followed the settings in He et al. (2022) . We use global average pool of the final representation instead of the cls token.MAE longer training: MAE pre-training for longer training (5000 epochs) on JFT becomes unstable after about 500k steps (training loss oscillates); this results in poorer fine-tuning performance.To overcome this, we decrease the base learning rate by 75% as shown in Table 9 . However our model CAN is more stable and we use the same hyperparameters across different numbers of epochs.Few shot training: For few-shot learning we use the same hyperparameters and pipeline as Dosovitskiy et al. (2021a) . We use the same pre-processing as was done in (Kolesnikov et al., 2020) . We use a base learning rate of 0.01 and train for 2500 steps, using an input resolution of 384 × 384.Hardware details: We use TPU-v4 for all of our experiments. CAN on ViT-B uses 64 TPUs for a batch size of 4096. SimCLR, on the other hand, uses 128 TPUs for the same batch size, and is more compute intensive than CAN.Decoder architecture: Our decoder architecture is the same as He et al. (2022) . We use standard ViT with a decoder depth of 8 and decoder width of 512. We use 16 heads and 2048 as the dimension of the MLP.Projection head architecture: We use 2 hidden layers in our projection heads. Each layer has a Fully-Connected (FC) layer (dim 4096) followed by BatchNorm (momentum=0.9) followed by ReLU. After these 2 layers we have a FC layer which transforms the features to 128 dimensions. We apply contrastive learning on top of these 128 dimensional features.JFT-300M specific hyperparameters: All hyperparameters were determined by training on IN-1K, and directly transferred with JFT-300M pre-training, with the exception of learning rate and weight decay, which found needed to be at a lower level for JFT-300M. For all methods we divided the learning rate by a factor of 4, and the weight decay by a factor of 2, except for MAE where we found that the original weight decay tuned on ImageNet worked better. Specifically, for CAN and SimCLR we used following parameter choices: wd = 0.1/2 = 0.05 and lr = 1.25 × 10 -4 /4 = 3.125 × 10 -5 and for MAE we used lr = 1.5 × 10 -4 /4 = 3.75 × 10 -5 , and tried wd = 0.05/2 = 0.025, but found that the original wd = 0.05 worked better, so kept this value.C 

