ARCL: ENHANCING CONTRASTIVE LEARNING WITH AUGMENTATION-ROBUST REPRESENTATIONS

Abstract

Self-Supervised Learning (SSL) is a paradigm that leverages unlabeled data for model training. Empirical studies show that SSL can achieve promising performance in distribution shift scenarios, where the downstream and training distributions differ. However, the theoretical understanding of its transferability remains limited. In this paper, we develop a theoretical framework to analyze the transferability of self-supervised contrastive learning, by investigating the impact of data augmentation on it. Our results reveal that the downstream performance of contrastive learning depends largely on the choice of data augmentation. Moreover, we show that contrastive learning fails to learn domain-invariant features, which limits its transferability. Based on these theoretical insights, we propose a novel method called Augmentation-robust Contrastive Learning (ArCL), which guarantees to learn domain-invariant features and can be easily integrated with existing contrastive learning algorithms. We conduct experiments on several datasets and show that ArCL significantly improves the transferability of contrastive learning.

1. INTRODUCTION

A common assumption in designing machine learning algorithms is that training and test samples are drawn from the same distribution. However, this assumption may not hold in real-world applications, and algorithms may suffer from distribution shifts, where the training and test distributions differ. This issue has motivated a plethora of research in various settings, such as transfer learning, domain adaptation and domain generalization (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021) . Different ways of characterizing the relationship between test and training distributions lead to different algorithms. Most literature studies this in the supervised learning scenario. It aims to find features that capture some invariance across different distributions, and assume that such invariance also applies to test distributions (Peters et al., 2016; Rojas-Carulla et al., 2018; Arjovsky et al., 2019; Mahajan et al., 2021; Jin et al., 2020; Ye et al., 2021) . Self-Supervised Learning (SSL) has attracted great attention in many fields (He et al., 2020; Chen et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) . It first learns a representation from a large amount of unlabeled training data, and then fine-tunes the learned encoder to obtain a final model on the downstream task. Due to its two-step nature, SSL is more likely to encounter the distribution shift issue. Exploring its transferability under distribution shifts has become an important topic. Some recent works study this issue empirically (Liu et al., 2021; Goyal et al., 2021; von Kügelgen et al., 2021; Wang et al., 2021b; Shi et al., 2022) . However, the theoretical understanding is still limited, which also hinders the development of algorithms. In this paper, we study the transferability of self-supervised contrastive learning in distribution shift scenarios from a theoretical perspective. In particular, we investigate which downstream distribu-tions will result in good performance for the representation obtained by contrastive learning. We study this problem by deriving a connection between the contrastive loss and the downstream risk. Our main finding is that data augmentation is essential: contrastive learning provably performs well on downstream tasks whose distributions are close to the augmented training distribution. Moreover, the idea behind contrastive learning is to find representations that are invariant under data augmentation. This is similar to the domain-invariance based supervised learning methods, since applying each kind of augmentation to the training data can be viewed as inducing a specific domain. Unfortunately, from this perspective, we discover that contrastive learning fails to produce a domain-invariant representation, limiting its transferability. To address this issue, we propose a new method called Augmentation-robust Contrastive Learning (ArCL), which can be integrated with various widely used contrastive learning algorithms, such as SimCLR (Chen et al., 2020) and MoCo (He et al., 2020) . In contrast to the standard contrastive learning, ArCL forces the representation to align the two farthest positive samples, and thus provably learns domain-invariant representations. We conducted experiments on various downstream tasks to test the transferability of representations learned by ArCL on CIFAR10 and ImageNet. Our experiments demonstrate that ArCL significantly improves the standard contrastive learning algorithms.

RELATED WORK

Distribution shift in supervised learning. Distribution shift problem has been studied in many literature (Blanchard et al., 2011; Muandet et al., 2013; Wang et al., 2021a; Shen et al., 2021) . Most works aim to learn a representation that performs well on different source domains simultaneously (Rojas-Carulla et al., 2018; Mahajan et al., 2021; Jin et al., 2020) , following the idea of causal invariance (Peters et al., 2016; Arjovsky et al., 2019) . Structural equation models are often assumed for theoretical analysis (von Kügelgen et al., 2021; Liu et al., 2020; Mahajan et al., 2021) . Distributionally robust optimization optimizes a model's worst-case performance over some uncertainty set directly (Krueger et al., 2021; Sagawa et al., 2019; Duchi & Namkoong, 2021; Duchi et al., 2021) . Stable learning (Shen et al., 2020; Kuang et al., 2020) learns a set of global sample weights that could remove the confounding bias for all the potential treatments from data distribution. Disentangled representation learning (Bengio et al., 2013; Träuble et al., 2021; Kim & Mnih, 2018) aims to learn representations where distinct and informative factors of variations in data are separated. Theoretical understanding of contrastive learning. A number of recent works also aim to theoretically explain the success of contrastive learning in IID settings. One way to explain it is through the mutual information between positive samples (Tian et al., 2020; Hjelm et al., 2018; Tschannen et al., 2019) . Arora et al. (2019) directly analyze the generalization of InfoNCE loss based on the assumption that positive samples are drawn from the same latent classes. In the same setting, Bao et al. (2022) 

2. PROBLEM FORMULATION

Given a set of unlabeled data where each sample X is i.i.d. sampled from training data distribution D on X ⊆ R d , the goal of Self-Supervised Learning (SSL) is to learn an encoder f : X → R m for different downstream tasks. Contrastive learning is a popular approach of SSL, which augments each sample X twice to obtain a positive pair (X 1 , X 2 ), and then learns the encoder f by pulling them close and pushing random samples (also called negative samples) away in the embedding space. The data augmentation is done by applying a transformation A to the original data, where A is randomly selected from a transformation set A according to some distribution π. We use



establish equivalence between InfoNCE and supervised loss and give sharper upper and lower bounds. Huang et al. (2021) take data augmentation into account and provide generalization bounds based on the nearest neighbor classifier. Contrastive learning in distribution shift. Shen et al. (2022) and HaoChen et al. (2022) study contrastive learning in unsupervised domain adaptation, where unlabeled target data are obtained. Shi et al. (2022) show that SSL is the most robust on distribution shift datasets compared to autoencoders and supervised learning. Hu et al. (2022) improve the out-of-distribution performance of SSL from an SNE perspective. Other robust contrastive learning methods(Kim et al., 2020; Jiang  et al., 2020)  focus on adversarial robustness while this paper focuses on distributional robustness.

funding

work was partially done when Xuyang was visiting Qing Yuan Research Institute.

