IS SELF-SUPERVISED CONTRASTIVE LEARNING MORE ROBUST THAN SUPERVISED LEARNING?



We conduct a series of robustness tests based on data distribution corruptions from micro to macro levels, to study the behavior of contrastive and supervised learning beyond accuracy. Our results reveal that contrastive learning is usually more robust than supervised learning to downstream corruptions (∆ D CL < ∆ D SL ), while shows opposite behaviors to pre-training pixel-and patch-level corruptions (∆ P CL > ∆ P SL ) and pretraining dataset-level corruptions (∆ P CL < ∆ P SL ), where ∆ is the accuracy drop from uncorrupted settings.

Abstract

Prior work on self-supervised contrastive learning has primarily focused on evaluating the recognition accuracy, but has overlooked other behavioral aspects. In addition to accuracy, distributional robustness plays a critical role in the reliability of machine learning models. We design and conduct a series of robustness tests to quantify the behavioral differences between contrastive learning and supervised learning to downstream and pre-training data distribution changes. These tests leverage data corruptions at multiple levels, ranging from pixel-level distortion to patch-level shuffling and to dataset-level distribution shift, including both natural and unnatural corruptions. Our tests unveil intriguing robustness behaviors of contrastive and supervised learning: while we generally observe that contrastive learning is more robust than supervised learning under downstream corruptions, we surprisingly discover the robustness vulnerability of contrastive learning under pixel and patch level corruptions during pre-training. Furthermore, we observe the higher dependence of contrastive learning on spatial image coherence information during pre-training, e.g., it is particularly sensitive to global patch shuffling. We explain these results by connecting to feature space uniformity and data augmentation. Our analysis has implications in improving the downstream robustness of supervised learning, and calls for more studies on understanding contrastive learning.

1. INTRODUCTION

In recent years, self-supervised contrastive learning (CL) has demonstrated tremendous potential in learning generalizable representations from unlabeled datasets (Chen et al., 2020b; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Chen & He, 2021; Zhong et al., 2021b) . Current state-of-the-art CL algorithms learn representations from ImageNet (Deng et al., 2009) that match or even exceed the accuracy of their supervised learning (SL) counterparts on ImageNet and downstream tasks. However, beyond accuracy, little attention is paid on comparing other behavioral differences between contrastive learning and supervised learning, and even less work investigates the robustness during pre-training. Robustness is an important aspect to evaluate machine learning algorithms. Our main results consist of two sets of experiments: The first set investigates the downstream robustness of pre-trained models towards corruptions of downstream data. The second set studies the robustness under pre-training data corruptions -when the accuracy degradation of an algorithm to some corruption is large, it suggests that the algorithm may leverage such information as learning signal. Note that our work is inspired by Zhang et al. ( 2017) and Ribeiro et al. ( 2020) and follows a similar empirical exploratory analysis, rather than a regular adversarial robustness paradigm. We deliver a set of intriguing new discoveries. We generally observe that CL is consistently more robust than SL to downstream corruptions. Meanwhile, contrastive learning on corrupted pre-training leads to diverging observations: CL is more robust to dataset-level corruption than SL, but much less so to pixel-and patch-level corruptions. Moreover, we discover the higher dependence of contrastive learning on spatial information during pre-training, such that a global patch shuffling corruption harms feature learning greatly. To understand why pre-trained CL models are more robust to downstream corruptions, we analyze the learning dynamics through feature space metrics and find that CL yields larger overall and steadily-increasing per-class feature uniformity and higher stability than SL. The instance-level CL objective might capture richer sets of features not limited to semantic classes. Therefore, the perclass uniformity or intra-class variation is not compressed as hard as in SL. This allows the CL models to generalize to unseen corrupted downstream data better than SL. Such hypothesis aligns well with several recent attempts to understand CL (Zhao et al., 2021; Chen et al., 2021a; Liu et al., 2022) . An immediate consequence of our insight is an improvement to supervised pre-training by adding a uniformity regularization term to explicitly promote intra-class variance, where the testtime data corruption robustness is improved. As for CL's vulnerability to pre-training data corruptions such as patch shuffling, we speculate that CL is more dependent on the spatial structure of images, and the introduction of high-frequency noise undermines the long-scale spatial coherence of natural images. For example, with global patch shuffling, the random resized cropping used in CL is no longer a proper data augmentation. We verify our intuition by manipulating data pre-processing and analyzing attention maps. We find that corrupting after standard data augmentation recovers a substantial amount of robustness, making CL comparably robust to SL. We summarize our contributions as follows. (1) We design extensive distributional robustness tests to study the behavioral differences of CL and SL systematically. (2) We discover diverging robustness behaviors between CL and SL, and even among different CL algorithms. (3) We offer analyses and explanations for such observations, and show a simple way to improve the downstream robustness of supervised learning. We claim our paper as an empirical study. We hope our findings can serve as an initial step to fully understand CL's behaviors beyond accuracy and inspire more future studies to explore such aspects through theoretical analysis.

2. RELATED WORK

Self-Supervised Learning (SSL) and Contrastive Learning (CL). Remarkable progress has been made in self-supervised representation learning from unlabeled datasets (Chen et al., 2020b; He 



Figure 1: We conduct a series of robustness tests based on data distribution corruptions from micro to macro levels, to study the behavior of contrastive and supervised learning beyond accuracy. Our results reveal that contrastive learning is usually more robust than supervised learning to downstream corruptions (∆ D CL < ∆ D SL ), while shows opposite behaviors to pre-training pixel-and patch-level corruptions (∆ P CL > ∆ P SL ) and pretraining dataset-level corruptions (∆ P CL < ∆ P SL ), where ∆ is the accuracy drop from uncorrupted settings.

Shen et al., 2020). Robustness to input corruptions at test-time plays an important role in reliable deployment of trained models in safety-critical applications, as signified by the existence of adversarial examples(Goodfellow et al., 2015; Salman et al., 2020)  and the negative impact of domain shift(Zhao et al., 2019).In this paper, we investigate whether CL and SL behave robustly to data distribution changes. In particular, how does changes in data affect behaviors of algorithms? And do CL and SL behave similarly? To this end, we design a wide-spectrum of corruptions as shown in Figure1to alter data distribution and conduct comprehensive experiments, with different backbones, CL algorithms and datasets. The corruptions are carefully selected to be multi-level, targeting both human-recognizable and unrecognizable structural information, and are rooted in prior literature: pixel-level corruptions distorts intensity distribution, patch-level shuffle corrupts spatial structure(Ge et al., 2021;

