YOUR CONTRASTIVE LEARNING IS SECRETLY DOING STOCHASTIC NEIGHBOR EMBEDDING

Abstract

Contrastive learning, especially self-supervised contrastive learning (SSCL), has achieved great success in extracting powerful features from unlabeled data. In this work, we contribute to the theoretical understanding of SSCL and uncover its connection to the classic data visualization method, stochastic neighbor embedding (SNE) (Hinton & Roweis, 2002), whose goal is to preserve pairwise distances. From the perspective of preserving neighboring information, SSCL can be viewed as a special case of SNE with the input space pairwise similarities specified by data augmentation. The established correspondence facilitates deeper theoretical understanding of learned features of SSCL, as well as methodological guidelines for practical improvement. Specifically, through the lens of SNE, we provide novel analysis on domain-agnostic augmentations, implicit bias and robustness of learned features. To illustrate the practical advantage, we demonstrate that the modifications from SNE to t-SNE (Van der Maaten & Hinton, 2008) can also be adopted in the SSCL setting, achieving significant improvement in both in-distribution and out-of-distribution generalization.

1. INTRODUCTION

Recently, contrastive learning, especially self-supervised contrastive learning (SSCL) has drawn massive attention, with many state-of-the-art models following this paradigm in both computer vision (He et al., 2020a; Chen et al., 2020a; b; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) and natural language processing (Fang et al., 2020; Wu et al., 2020; Giorgi et al., 2020; Gao et al., 2021; Yan et al., 2021) . In contrast to supervised learning, SSCL learns the representation through a large number of unlabeled data and artificially defined self-supervision signals, i.e., regarding the augmented views of a data sample as positive pairs and randomly sampled data as negative pairs. By enforcing the features of positive pairs to align and those of negative pairs to be distant, SSCL produces discriminative features with state-of-the-art performance for various downstream tasks. Despite the empirical success, the theoretical understanding is under-explored as to how the learned features depend on the data and augmentation, how different components in SSCL work and what are the implicit biases when there exist multiple empirical loss minimizers. For instance, SSCL methods are widely adopted for pretraining, whose feature mappings are to be utilized for various downstream tasks which are usually out-of-distribution (OOD). The distribution shift poses great challenges for the feature learning process with extra requirement for robustness and OOD generalization (Arjovsky et al., 2019; Krueger et al., 2021; Bai et al., 2021; He et al., 2020b; Zhao et al., 2023; Dong et al., 2022) , which demands deeper understanding of the SSCL methods. The goal of SSCL is to learn the feature representations from data. For this problem, one classic method is SNE (Hinton et al., 2006) and its various extensions. Specially, t-SNE (Van der Maaten & Hinton, 2008) has become the go-to choice for low-dimensional data visualization. Comparing to SSCL, SNE is far better explored in terms of theoretical understanding (Arora et al., 2018; Linderman & Steinerberger, 2019; Cai & Ma, 2021) . However, its empirical performance is not satisfactory, especially in modern era where data are overly complicated. Both trying to learn feature representations, are there any deep connections between SSCL and SNE? Can SSCL take the advantage of the theoretical soundness of SNE? Can SNE be revived in the modern era by incorporating SSCL?

