IS SELF-SUPERVISED CONTRASTIVE LEARNING MORE ROBUST THAN SUPERVISED LEARNING?

Abstract

Δ SL Pretrain < Δ CL Pretrain Δ SL Downstream > Δ CL Downstream Δ SL Pretrain > Δ CL Pretrain Figure 1: We conduct a series of robustness tests based on data distribution corruptions from micro to macro levels, to study the behavior of contrastive and supervised learning beyond accuracy. Our results reveal that contrastive learning is usually more robust than supervised learning to downstream corruptions (∆ D CL < ∆ D SL ), while shows opposite behaviors to pre-training pixel-and patch-level corruptions (∆ P CL > ∆ P SL ) and pretraining dataset-level corruptions (∆ P CL < ∆ P SL ), where ∆ is the accuracy drop from uncorrupted settings.

1. INTRODUCTION

In recent years, self-supervised contrastive learning (CL) has demonstrated tremendous potential in learning generalizable representations from unlabeled datasets (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen & He, 2021; Zhong et al., 2021b) . Current state-of-the-art CL algorithms learn representations from ImageNet (Deng et al., 2009 ) that match or even exceed the accuracy of their supervised learning (SL) counterparts on ImageNet and downstream tasks. However, beyond accuracy, little attention is paid on comparing other behavioral differences between contrastive learning and supervised learning, and even less work investigates the robustness during pre-training. Robustness is an important aspect to evaluate machine learning algorithms. For example, robustness to long-tail or noisy training data allows the learning algorithm to work well in a wide variety of imperfect real-world scenarios (Wang et al., 2017) . Robustness of the model output across training iterations enables anytime early-stop (Hu et al., 2019) and smoother continual learning (Shen et al., 2020) . Robustness to input corruptions at test-time plays an important role in reliable deployment of trained models in safety-critical applications, as signified by the existence of adversarial examples (Goodfellow et al., 2015; Salman et al., 2020) and the negative impact of domain shift (Zhao et al., 2019) . In this paper, we investigate whether CL and SL behave robustly to data distribution changes. In particular, how does changes in data affect behaviors of algorithms? And do CL and SL behave similarly? To this end, we design a wide-spectrum of corruptions as shown in Figure 1 to alter data distribution and conduct comprehensive experiments, with different backbones, CL algorithms and datasets. The corruptions are carefully selected to be multi-level, targeting both human-recognizable and unrecognizable structural information, and are rooted in prior literature: pixel-level corruptions distorts intensity distribution, patch-level shuffle corrupts spatial structure (Ge et al. Our main results consist of two sets of experiments: The first set investigates the downstream robustness of pre-trained models towards corruptions of downstream data. The second set studies the robustness under pre-training data corruptions -when the accuracy degradation of an algorithm to some corruption is large, it suggests that the algorithm may leverage such information as learning signal. Note that our work is inspired by Zhang et al. (2017) and Ribeiro et al. (2020) and follows a similar empirical exploratory analysis, rather than a regular adversarial robustness paradigm. We deliver a set of intriguing new discoveries. We generally observe that CL is consistently more robust than SL to downstream corruptions. Meanwhile, contrastive learning on corrupted pre-training leads to diverging observations: CL is more robust to dataset-level corruption than SL, but much less so to pixel-and patch-level corruptions. Moreover, we discover the higher dependence of contrastive learning on spatial information during pre-training, such that a global patch shuffling corruption harms feature learning greatly. To understand why pre-trained CL models are more robust to downstream corruptions, we analyze the learning dynamics through feature space metrics and find that CL yields larger overall and steadily-increasing per-class feature uniformity and higher stability than SL. The instance-level CL objective might capture richer sets of features not limited to semantic classes. Therefore, the perclass uniformity or intra-class variation is not compressed as hard as in SL. This allows the CL models to generalize to unseen corrupted downstream data better than SL. Such hypothesis aligns well with several recent attempts to understand CL (Zhao et al., 2021; Chen et al., 2021a; Liu et al., 2022) . An immediate consequence of our insight is an improvement to supervised pre-training by adding a uniformity regularization term to explicitly promote intra-class variance, where the testtime data corruption robustness is improved. As for CL's vulnerability to pre-training data corruptions such as patch shuffling, we speculate that CL is more dependent on the spatial structure of images, and the introduction of high-frequency noise undermines the long-scale spatial coherence of natural images. For example, with global patch shuffling, the random resized cropping used in CL is no longer a proper data augmentation. We verify our intuition by manipulating data pre-processing and analyzing attention maps. We find that corrupting after standard data augmentation recovers a substantial amount of robustness, making CL comparably robust to SL. We summarize our contributions as follows. (1) We design extensive distributional robustness tests to study the behavioral differences of CL and SL systematically. (2) We discover diverging robustness behaviors between CL and SL, and even among different CL algorithms. (3) We offer analyses and explanations for such observations, and show a simple way to improve the downstream robustness of supervised learning. We claim our paper as an empirical study. We hope our findings can serve as an initial step to fully understand CL's behaviors beyond accuracy and inspire more future studies to explore such aspects through theoretical analysis. There is a growing body of literature on understanding SSL. Wang & Liu (2021) decomposes the contrastive objective into alignment (between augmentations) and uniformity (across entire feature space) terms. Uniformity can be thought of as an estimate of the feature entropy, which we use to study the feature space dynamics during training. Wang & Isola (2020) makes connection between uniformity and the temperature parameter in contrastive loss, and finds that a good temperature can balance uniformity and tolerance of semantically similar examples. Zhao et al. (2021) discovers that SSL transferring better than SL can be due to better low-and mid-level features, and the intra-class invariance objective in SL weakens transferability by causing more pre-training and downstream task misalignment. Ericsson et al. (2021) studies the downstream task accuracy of a variety of pre-trained models and finds that SSL outperforms SL on many tasks. Cole et al. (2022) investigates the impact of pre-training data size, domain quality, and task granularity on downstream performance. Chen et al. (2021a) identifies three intriguing properties of CL: a generalized version of the loss, learning with the presence of multiple objects, and feature suppression induced by competing augmentations. Our work falls into the same line of research that attempts to understand SSL better. However, we investigate from the angle of robustness behavior comparison between SSL/CL and SL.

2. RELATED WORK

Robustness and Data Corruption. The success of learning algorithms is often measured by some form of task accuracy, such as the top-1 accuracy for image classification (Deng et There are many types of data corruptions in prior work. The most common data corruptions, such as random resizing and cropping, flipping, and color jittering, appear as data augmentation in SL and SSL (He et al., 2016; 2020; Chen et al., 2020b) . The learned representation is encouraged to be invariant to such corruptions. Hendrycks (Jahanian et al., 2021) studies generative models as an alternative data source for contrastive learning. They focus on comparison with real data, while we emphasize the behavior difference of SSL and SL in response to the generative data source. Feature backward-compatibility (Shen et al., 2020) is related to our stability analysis of feature dynamics. Recently, Goyal et al. (2021a) studies the effectiveness of SSL on uncurated class-imbalanced data. Liu et al. (2022) also notices that SSL tends to be more robust to class imbalance than SL. We bring extra insights over them. We consider both pre-training and downstream robustness and compare CL vs. SL behaviors, while Goyal et al. (2021a) only focuses on downstream and compares dataset scale. Our investigation suggests that pre-train behavior can be opposite to downstream. Liu et al. (2022) only studies class imbalance, but we also consider image structural corruptions.

3. METHOD

We define distributional robustness as robustness against various distribution shifts of input images by carefully-designed data corruptions, and evaluate the distributional robustness of different algorithms by observing the impact. We refer to the behavior of a learning algorithm as how it learns the representations and how such learning evolves throughout training. To what extent will such corruptions influence the performance? Will there be consistent trends that depend on the type of the corruptions? And will there be a behavioral difference between CL and SL?

3.1. ROBUSTNESS TESTS

The common way of using CL or SL models is through the pre-training and fine-tuning paradigm (Chen et al., 2020b; He et al., 2020; Zhong et al., 2021a; b) . A neural backbone is pre-trained on a large-scale dataset such as ImageNet (Deng et al., 2009) or composite dataset of images scrapped from Internet with mixed quality (Radford et al., 2021) , and transferred to initialize downstream models or inference. Therefore, it is crucial to consider the impact of data corruptions in both the pre-training and the downstream phases. Since data corruption destroys certain information by design, both settings on the corrupted data are expected to yield degraded performance. Specifically, we perform the following two complementary types of tests. Robustness Test I: Downstream data corruption. In this test, the pre-training algorithm is run on the clean version of the pre-training dataset. For a given downstream dataset, we evaluate the pre-trained model's accuracy on its original version and various corrupted versions. This assesses the robustness of the algorithm by looking at the pre-trained model's robustness behaviors. Robustness Test II: Pre-training data corruption. To assess the algorithm's robustness to pretraining data corruptions, we run the pre-training algorithm on the corrupted version of the dataset, and then evaluate the final model's accuracy on either the corrupted test set or the original test set. The test set can be in-domain (the same domain as the train set) or out-domain (a different domain from the train set). Robustness Metric. In both cases, the robustness is measured by the degradation in accuracy caused by certain data corruption. An algorithm is more robust if the degradation is smaller. Denote D original as the original dataset and D corrupted as the corrupted dataset. For an algorithm Alg ∈ {CL, SL}, we define ∆(Alg) as Acc(Alg,Doriginal)-Acc(Alg,Dcorrupted) Acc(Alg,Doriginal) . The essential question we are asking is whether ∆(CL) is consistently larger or smaller than ∆(SL) across different data corruptions. We use two methods to obtain the test accuracy in the above equation. The first is linear evaluation, where we train a linear classifier on top of the learned representations on the train split and evaluate on the test split. The second is KNN evaluation following Wu et al. (2018) , where the prediction is the exponential-distance weighted average of the K nearest neighbors in the train split of any test data point, measured by the normalized feature vectors. The KNN evaluation effectively leverages an non-parametric classifier, therefore no classifier training is required.

3.2. DATA CORRUPTION TYPES

There is a natural hierarchy of data corruptions ranging conceptually from micro-level to macrolevel. We describe our choices below (also illustrated in Figure 1 ). Note that our data corruption is different from data augmentation randomly applies transform on a per-image basis. In our case, a fixed random transformation (e.g., the γ in gamma distortion or the permutation order in shuffling) is decided first and then applied consistently across all images. We effectively transform the entire dataset with the corruption method. We emphasize that our purpose is not only to study human-recognizable distortions, but to evaluate pre-training algorithms' behavior under various distortions. To this end, our collection of corruptions is designed to be representative and comprehensive: while some of them are practical (natural corruptions, imbalance), others are purposefully introduced to distort certain structural information (shuffling). A similar flavor of behavior study was seen in Zhang et al. (2017) . Pixel-Level Corruption. The pixel intensity distribution is altered, but neither the spatial layout of each image nor the overall data distribution is changed. Here, we deliberately pick gamma distortion and selected ImageNet-C corruptions ( ) to analyze the performance of the pre-trained models on the corrupted downstream tasks. For fair comparisons, we use the same data augmentation across methods when we need to train any model. Contrary to downstream corruptions where CL demonstrates consistent higher robustness, whether CL is more robust than SL depends on the type of corruption during pre-training. Extensive experiments show that SL is more robust to pixel-and patch-level corruptions. Table 3 shows the impacts of gamma distortion and patch shuffling on CL and SL during pretraining. We train SL for 30 epochs and CL for 200 epochs (except for DINO which is trained for 600 epochs) for comparable clean data accuracy via linear evaluation. The ∆ of SL due to gamma distortion is 2.4% which outperforms all the tested CL methods. For pre-training patch shuffling corruption, all CL methods behave similarly and less robustly than SL, except for the L8x8 case where Sup and MoCo-v2 are comparable. We also extend to natural pixel-level corruptions for MoCo-v2 and SL as shown in 

4.3. CL IS MORE ROBUST TO PRE-TRAINING DATASET-LEVEL CORRUPTIONS

To investigate pre-training distribution shift caused by synthesized data, we adopt a class-conditional StyleGAN2-ADA (Karras et al., 2020) trained on CIFAR-10 to generate a synthesize copy of same size. We train MoCo-v2 for 200 epochs and SL for 50 epochs (both ResNet-18 backbones) with different train/test data settings, reporting performance differences in Table 5 . When training on the synthesized data and testing on the original CIFAR-10, MoCo-v2 only has 2.58%∆, greatly outperforming the supervised method with 8.44%∆. Evaluating on a GAN-synthesized test set yields similar observation -MoCo-v2 shows almost no drop while Sup drops 6%. Testing on outdomain CIFAR-100 delivers the same behavior. Table 6 shows the impact of class imbalance. We use ImageNet-LT (long-tail) dataset to simulate the real-world long-tail class distribution (Liu et al., 2019) , and we sample a balanced subset of ImageNet named ImageNet-UF (uniform), with the same size as ImageNet-LT. We train with ResNet-50 backbone and compare the recognition accuracy on the ImageNet-LT validation split of the fine-tuned linear classifiers on ImageNet-UF. Despite a gap between the baseline top-1 accuracy of MoCo-v2 and SL, we observe that the decline of MoCo resulting from pre-training on the long-tail rather than the uniform version is much smaller than SL. In fact, the MoCo performance appears to be insensitive to class balance or imbalance (the top-1 ∆ is only 0.71%). This is contrary to SL, which shows a larger drop. The difference is more salient by looking at the low-shot (< 20 images per class), medium-shot, and many-shot (> 100 images per class) accuracy separately. Supervised pre-training on the long-tail version sacrifices the low-shot accuracy for a higher many-shot accuracy, whereas MoCo-v2 pre-training shows insignificant difference among the shots. Our observation is consistent with a contemporary work (Liu et al., 2022) . (Liu et al., 2019) and ImageNet-UF (uniform). We train a linear classifier on ImageNet-UF and report accuraccies on ImageNet-LT-Val (20K images). Low-shot refers to classes with less than 20 images, many-shot with more than 100, and med-shot in between. MoCo shows less sensitivity to pre-train data imbalance than Sup with smaller ∆ and variance. Discussion. We try to balance diversity and setup unity under computation budget. Within each table, the setup is consistent, allowing comparison of SL and CL; across tables, we intentionally evaluate if the observation is generalizable across backbones and datasets. For example, Tables 1 and 2 are the same corruptions but varying backbones; Table B .3 extends the same observation from small-scale in Table 3 to larger-scale. The ∆ metric could be unreliable when the original uncorrupted accuracy differs too much across methods. We overcame it by: (1) controlling the original accuracy to be relatively close, (2) testing multiple datasets, backbones, and corruption settings to draw consistent conclusions from more data points. The robustness discrepancy between CL (e.g., MoCo) and SL is not only reflected in the final trained models, but is in fact also attributed in the training process. To analyze how the feature space evolves during training, we measure the following three metrics: 1. Feature Semantic Fluctuation. We monitor the classification ability of the feature extractor by the accuracy of a KNN probe. We define feature semantic fluctuation of class i as the total variation of per-class accuracy of class i (as a function of epoch t) averaged over all epochs:

5.1. CL'S HIGHER DOWNSTREAM ROBUSTNESS IS RELATED TO A MORE UNIFORM AND STABLE FEATURE SPACE DURING TRAINING

T V i = 1 T -1 T -2 t=0 |Acc (i) t+1 -Acc (i) t |. We further define the mean feature semantic fluctuation as the mean of T V i over all classes. Larger semantic fluctuation indicates less stable feature space. 2. Feature Uniformity. We can measure the uniformity of all the features or classwise features as the log-mean of Gaussian potentials of the normalized features: U (f t , D) = -log E x0,x1∼D e -2∥ft(x0)-ft(x1)∥ 2 2 . Here f t is the network at epoch t, D is the dataset, and x 0 and x 1 are images sampled from the dataset. The use of this measure to study contrastive learning is exemplified in (Wang & Isola, 2020) . Intuitively, a greater U means more uniformly distributed features on the unit sphere, while a smaller value means more concentrated features. 3. Feature Distance. We also measure the average feature squared ℓ 2 distance between two classes. A larger distance could mean more linear separability. Denoting D i and D j as feature matrices of two classes, the fea-  d(f t , D i , D j ) = E x0∼Di,x1∼Dj ∥f t (x 0 ) -f t (x 1 )∥ 2 2 . Note that if D i = D j , it actually measures the intra-class variance of class i. We train ResNet-18 (He et al., 2016) on the original CIFAR-10 ( Krizhevsky et al., 2009) train split and measure the above metrics on the test split. Figure 3 shows the dynamics of feature uniformity and distances of MoCo-v2 (He et al., 2020; Chen et al., 2020d) , supervised contrastive (SupCon) (Khosla et al., 2020) , and supervised learning. We are interested in SupCon, because it bridges CL and SL by leveraging a similar contrastive loss. As illustrated, the overall feature uniformity of MoCo-v2 (Chen et al., 2020d ) is greater than 2.5 and approaching 3, while the overall uniformity of SupCon and supervised methods range from 1.25 to 2.2. This means that features from CL methods are more uniformly distributed on the unit sphere. By looking at the class-wise feature uniformity and distance, we notice that SL tends to compress (and maybe over-compress) the features of each class. Figure 2 shows that the accuracy of a KNN probe during supervised learning also fluctuates more dramatically. We can interpret it as that the classes are competing with each other, and SL cannot improve the performance on all classes at the same time like CL methods. We hypothesize that uniformity is the key to CL's higher downstream robustness, because, intuitively, a more uniform feature space may capture richer characteristics of images and gives the pretrained model a higher chance to extract useful representation from downstream images, corrupted or not. We test this hypothesis by checking whether SL can benefit from an extra uniformity-promoting loss term. Table 7 briefly demonstrates that adding (or subtracting) the uniformity regularization produces a more (or less) uniform test feature space. This experiment suggests that we could improve SL by leveraging loss functions from CL and potentially get the best of both worlds.

5.2. CL'S LOWER PRE-TRAINING ROBUSTNESS MAY RELATE TO HIGHER DEPENDENCY ON IMAGE SPATIAL COHERENCE

The diverging robustness behaviors of CL to pre-training corruptions can stem from its higher dependence on image spatial structure. While little previous work examines corruptions during pretraining and its reliance on spatial information, we hypothesize that a high-frequency corruption signals applied globally to the data will harm the long-scale coherence. Such effect is intuitively straight-forward since the authentic spatial information will be eroded and the weighted importance will decrease with the introduction of corruptive information. Table B .8 demonstrates how shuffling interferes with data augmentation in the CIFAR-10 pre-training case. While standard shuffling produces largest avg ∆ = 22.0%, reversing the order of corruption and augmentation greatly ameliorates the ∆ of CL and produces comparable robustness to SL with ∆ = 6.0%. As we perform augmentation such as random resized crop after shuffling, we may select crop windows that capture pieces from different shuffled patches and do not reside in natural image statistics. To better view the destructive effect, Figure C .4 shows the attention maps of global shuffling comparing to the original. Contrastive pre-training with shuffled data leads to less dense and inaccurate attentions, essentially fails to learn good representations, which verifies CL's worse robustness. This also hints that contrastive learning is not really general -on certain types of images it fails. How to design general CL algorithms that work on all kinds of images remains an interesting question.

6. CONCLUSION

Our paper systematically studies the distributional robustness of CL and SL through a diverse set of multi-level data corruptions. We discover interesting robustness behaviors of CL to different corruptions. Our analysis of the feature space gives insight that uniformity might be the key to higher downstream robustness, while analyzing augmentation process and attention maps disclose the high dependence of contrastive learning on spatial information. Our results favor the current use of CL or a combination of CL and SL in visual representation learning, and calls for more research into understanding the behavior and the learning mechanism of CL.

B.2 PRE-TRAINING ROBUSTNESS TEST WITH LONGER EPOCHS

In Table 3 of the main paper, we mostly report results of short pre-training schedules: Sup 30 epochs and CL 200 epochs, in order to make the baseline results comparable. We report CIFAR-10 longer training epochs in (Loshchilov & Hutter, 2019) for others during fine-tuning. All models are fine-tuned for 10 epochs. We find this strategy of using different optimizers is able to make the baseline results on original images comparable across methods. We note that fine-tuning drastically improves the accuracy on downstream datasets, while the general observation that CL methods are more robust to downstream corruption than SL still holds, except for BarlowTwins which is slightly worse than SL. Another interesting observation here is that different CL methods actually yield different robustness behaviors, although they are all doing some form of contrastive learning and have similar baseline accuracies.

B.6 VARIANCE OF PRE-TRAINING RESULTS

We repeat MoCo-v2 on the original CIFAR-10 200ep three times: The KNN evaluation mean and std is 82.44±0.18. Repeating MoCo-v2 on the global 8x8 shuffling corrupted CIFAR-10 gives KNN evaluation mean and std 59.24 ± 0.40. The linear evaluation variance is similar. The randomness has a smaller order than the gap between MoCo and Sup results.

B.7 PRE-TRAIN ON CORRUPTED CIFAR-10, BUT TEST ON UNCORRUPTED IMAGES

In the main paper, we show the results when both the pre-training and evaluation datasets are corrupted in the same consistent way. In the following Table B .7, we report the accuracy numbers obtained from KNN evaluation on the original uncorrupted images. Since these models are pretrained on the pixel-or patch-level corrupted dataset, the results reflect the transfer capability of the 



, 2021; Neyshabur et al., 2020; Zhang et al., 2017; Hendrycks & Dietterich, 2019), and dataset-level class imbalance (Liu et al., 2022; 2019; Samuel & Chechik, 2021) and GAN (generative adversarial network) synthesis (Jahanian et al., 2021) shift the overall distribution.

Self-Supervised Learning (SSL) and Contrastive Learning (CL). Remarkable progress has been made in self-supervised representation learning from unlabeled datasets(Chen et  al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen & He, 2021). This paper focuses on a particular kind of SSL algorithm, contrastive learning, that learns augmentation invariance with a Siamese network. To prevent trivial solution, contrastive learning pushes negative examples apart (MoCo (He et al., 2020; Chen et al., 2020d; 2021b), SimCLR (Chen et al., 2020b;c)), makes use of stop-gradient operation or asymmetric predictor without using negatives (BYOL (Grill et al., 2020), SimSiam (Chen & He, 2021), DINO (Caron et al., 2021)), or leverages redundancy reduction (Bar-lowTwins (Zbontar et al., 2021)) and clustering (DeepCluster-v2 and SwAV (Caron et al., 2020)). In addition to augmentation invariance, generative pre-training (Ramesh et al., 2021; Bao et al., 2022; He et al., 2022) and visual-language pre-training (Radford et al., 2021) are promising ways to learn transferable representations.

Figure 2: Class-wise test accuracy of MoCo and SL on original CIFAR-10 during training. MoCo has more steady class-wise accuracy curves and smaller mean feature semantic fluctuation (T V) than SL.

Figure 3: Above: Solid black line -uniformity of the overall feature space. Dashed lines -class-wise feature uniformities of the 10 classes. While the overall uniformity of all methods grows, the uniformity of each class of Sup or SupCon is shrinking as training progresses. In the end, the overall uniformity of MoCo is the largest. Below: Solid black line -d(ft, D0, D0), i.e., the intraclass variance of class 0. Dashed lines -feature distances between Di(i ̸ = 0) and D0. The intra-class variance behavior of MoCo (increasing) is the opposite to that of Sup or SupCon (decreasing).

Figure B.1: H-divergence between the original dataset and the corrupted dataset as measured by training a simple network to distinguish them. Orig Sup,ori-trained MoCo,ori-trained Sup,G4x4-trained MoCo,G4x4-trained G4x4 shuffled Sup,ori-trained MoCo,ori-trained Sup,G4x4-trained MoCo,G4x4-trained

Figure C.2: GradCAM on corrupted versions of a dog image of sup/MoCo models trained under 7 corruptions.

Figure C.3: GradCAM on corrupted versions of a bird image of sup/MoCo models trained under 7 corruptions.



& Dietterich (2019) proposes a set of corruptions complementary to ours. Block shuffling (our image global shuffling) has been used to study what is transferred in transfer learning (Neyshabur et al., 2020) and as negative views with diminished semantics in contrastive learning (Ge et al., 2021). Cole et al. (2022) tampers data quality in SimCLR and SL training by salt-and-pepper noise, JPEG, resizing, and downsampling, and tests on clean data. We use a broader set of data corruptions and test on the corrupted data as well. A recent work

Hendrycks & Dietterich, 2019) since they are not part of the conventional data augmentation pipeline. • Gamma distortion: Gamma distortion remaps each RGB pixel intensity (∈ [0, 255]) according to x → ⌊255 × (x/255) γ ⌋, where γ > 0 is a tunable parameter. Larger or smaller γ shifts the intensities darker or brighter, respectively. Due to quantization error, there will be part of the intensity information lost during the process. • ImageNet-C: ImageNet-C (Hendrycks & Dietterich, 2019) focuses on natural, humanrecognizable corruptions such as noises, blurring, weathers, etc. We pick shot noise, defocus blur, and JPEG compression in our pre-training robustness experiments. Patch-Level Corruption. Inspired by Zhang et al. (2017), we consider random shuffling. Note that patch shuffling is not commonly used in the standard augmentation pipeline. We are curious about what behaviors CL and SL will exhibit when patch shuffling destroys certain structural coherence. Dataset-Level Corruption. We hereby consider corruptions happening at the whole dataset distribution level, as the previous two corruptions only change the images but not the overall distribution. Real-world data often follows a long-tail distribution, where a few common semantic classes have lots of examples while many tail classes have few examples (Kang et al., 2020; Samuel & Chechik, 2021). However, benchmark datasets such as CIFAR and ImageNet are curated and class-balanced. We consider the widely-used variant of ImageNet, ImageNet-LT (long-tail) (Liu et al., 2019), with maximally 1280 images and minimally 5 images per class. For comparison, we construct ImageNet-UF (uniform), a class-balanced subset of ImageNet which contains the same number of images as ImageNet-LT (115K). We test whether moving from pretraining on ImageNet-UF to ImageNet-LT would lead to different behaviors between CL and SL.

Robustness Test I: downstream pixel-and patch-level corruptions with ResNet-50 backbone. Models pre-trained on original ImageNet are downloaded from corresponding official websites ('IN Acc:' reference ImageNet Val accuracy). We consider 5 downstream datasets. For each dataset, we report the averages of 6 corruption settings: gamma distortion γ = {0.2, 5}, global and local shuffling (p = {4, image size/4}). The image size is 32 for C-10/100, 96 for STL-10, and 256 for the rest; the corrupted images are resized to 224 as input to the network. We compute the KNN accuracy (K=50 for C-10/100 and STL-10, K=5 for others) on corrupted test sets and report ∆ relative to the uncorrupted versions. Avg ∆ is the average over the 5 datasets (darker shades indicate higher drops). This table only shows ∆. Please refer to Appendix B.4 for more detailed accuracies. Contrastive learning models generally show lower accuracy drops and therefore higher downstream robustness than supervised models.

Robustness Test I: downstream pixel-and patch-level corruptions with ViT backbone. We show KNN accuracies and the ∆'s on three datasets. Similar to Table1, ViT CL models are also more robust than the two SL models, especially to gamma distortion. The generative method, MAE (He et al., 2022), is slightly more robust than CL to patch shuffling on CIFAR, but inferior on STL10 and more vulnerable to gamma distortion. We include average ∆ for each algorithm across all datasets for a clearer comparison. CIFAR10 94.23 71.42 (24.2%) 82.37 (12.6%) 64.09 (32.0%) 52.58 (44.2%) 52.54 (44.2%) 59.63 (36.7%) 32.3% 36.1% CIFAR100 79.86 48.70 (39.0%) 60.87 (23.8%) 40.95 (48.7%) 29.84 (62.6%) 28.91 (63.8%) 35.31 (55.8%) 49.0% DeiT (Sup) STL10 98.64 97.58 (1.1%) 98.01 (0.6%) 92.92 (5.8%) 46.99 (52.4%) 45.60 (53.8%) 73.22 (25.8%) 23.3% Alg Avg ∆: CIFAR10 95.37 90.66 (4.9%) 92.78 (2.7%) 73.24 (23.2%) 59.48 (37.6%) 53.10 (44.3%) 59.65 (37.5%) 25.0% 28.7% CIFAR100 78.23 68.98 (11.8%) 73.00 (6.7%) 49.86 (36.3%) 34.81 (55.5%) 29.49 (62.3%) 36.12 (53.8%) 37.7% DINO STL10 98.91 98.31 (0.6%) 98.17 (0.7%) 95.30 (3.7%) 50.36 (49.1%) 52.35 (47.1%) 79.96 (19.2%) 20.1% Alg Avg ∆: CIFAR10 96.68 92.85 (4.0%) 94.65 (2.1%) 77.99 (19.3%) 64.63 (33.2%) 60.79 (37.1%) 68.04 (29.6%) 20.9% Appendix B.4, B.5). With ResNet-50, we notice SimSiam, SwAV, and BarlowTwins to behave slightly more robust than others.

Robustness Test II: pre-training pixel-and patch-level corruptions of CIFAR10 with ResNet18, and full ImageNet with ResNet50. We use linear evaluation. We discover that SL is more robust than CL in this scenario. While CL methods obtain average ∆ about 20%, SL achieves 16.7% for CIFAR10 and 7.9% for ImageNet, which is lower than the best CL methods here (MoCo v2 and BYOL).

Robustness Test II: pre-training pixel-level natural corruptions of CIFAR10 with ResNet18 backbone, and full ImageNet (Deng et al., 2009) with ResNet50 backbone following (Hendrycks & Dietterich, 2019). We select MoCo-v2 (Chen et al., 2020d) to compare with SL on linear evaluation, and we pick shot noise, defocus blur, and JPEG compression as natural corruptions. On both datasets, SL achieves lower average ∆, which aligns with unnatural corruptions during pre-training.



Robustness Test II: pre-training synthesized data. C10/C100 refer to CIFAR-10/100. Interestingly, at absolute scale, MoCo shows higher downstream transfer accuracy to CIFAR-100 than SL, even through the 10 pre-training classes are only a small subset of the CIFAR-100.

Robustness Test II: pre-training class imbalance. We compare MoCo and SL on ImageNet-LT (long-tail)

Uniformity regularization directly influences supervised pre-training's downstream robustness (ResNet-18, CIFAR-10, 200 epochs). Positive uniformity term leads to higher KNN evaluation accuracy on corrupted data with no loss on original accuracy. Subtracting the uniformity term leads to the opposite. Unif 94.18, 1.98 72.85, 1.64 37.85, 0.98 39.70, 0.90 Sup+0.01Unif Acc, Unif 94.21, 2.69 74.47, 2.03 42.22, 1.11 44.34, 1.30 Sup-0.01Unif Acc, Unif 94.56, 1.12 71.50, 0.77 36.15, 0.41 37.88, 0.46 ture distance is calculated as:

Table B.2. Training longer does not change our observation that MoCo appears less robust to patch-and pixel-level corruptions than SL during pre-training on this dataset.

2: Pre-training robustness: Sup 50ep vs. MoCo-v2 400ep, ResNet-18, CIFAR-10.

3: Robustness Test II: pre-training pixel-and patch-level corruptions of ImageNet100. We focus our comparison on MoCo-v2 and SL to train on corrupted ImageNet100, which is a 100-class subset of ImageNet and substantially larger than CIFAR. SL still shows higher robustness to our pixel-level and patch-level corruptions, in agreement with Table3.

4  shows the detailed accuracy numbers for computing the summary statistics in Table1of the main paper.

in the main paper and TableB.4 above are generated with the KNN evaluation protocol. We also experiment with full fine-tuning on the downstream datasets. The results are in TableB.5. Since different pre-trained checkpoints are optimized with different optimizers (SGD

A ADDITIONAL IMPLEMENTATION DETAILS

 (Deng et al., 2009) . ImageNet-LT/UF are the long-tail and uniformly-subsampled versions. ImageNet-100 is a 100 class subset of full ImageNet-1K. We mainly list Sup and MoCo-v2 (Chen et al., 2020d) hyper-parameters here. The other CL methods follow their recommended hyper-parameter values in the Solo-Learn package (da Costa et al., 2022). B .1. We find that the MoCo-v3 degradation is larger with patch shuffling, but smaller with gamma distortion. Interestingly, the impact of patch shuffling is much smaller than a CNN (despite the Orig performance gap between ViT and CNN). We suspect that this is due to the unique patching and attention network structure of ViT. Essentially, if we do not take into consideration the data augmentation, with the right patch size, the shuffling within a small patch does not affect the learning of ViT much, and the global ordering of patches also does not matter much, because of learned positional embeddings and global attention. pre-trained representation from corrupted data to original data. We find that the trend is similar to evaluating on corrupted data that Sup appears more robust. We follow Lemma 2 from the H-divergence paper (Ben-David et al., 2010) and implement the objective using PyTorch. Since the objective is straightforward and a convolutional network converges very fast to zero loss (and H divergence 2), we adopt multilayer perceptron (MLP) with sigmoid function to observe the progress and differences between data corruptions. We select two different strengths for each of our proposed data corruption and observe stronger corruptions are indeed proportionally farther from the original dataset with higher H divergence. The selected data corruptions from ImageNet-C (Hendrycks & Dietterich, 2019) cannot be distinguished well by our MLP. We can resolve it with a deeper convolutional network and longer training, but it suffices to say that ImageNet-C provides mild corruptions, which also corresponds to the smaller performance drop shown in Table 4 . We do not evaluate dataset level corruptions since class imbalance already changes class distributions and GAN-synthesized dataset is trained to minimize divergence. To empirically verify the different dynamics of feature space, we have adopted a few metrics to evaluate the feature distance and uniformity, and quantify them for CL and SL models at each epoch to discuss the progress throughout pre-training as shown in Figure 2 

