YOUR CONTRASTIVE LEARNING IS SECRETLY DOING STOCHASTIC NEIGHBOR EMBEDDING

Abstract

Contrastive learning, especially self-supervised contrastive learning (SSCL), has achieved great success in extracting powerful features from unlabeled data. In this work, we contribute to the theoretical understanding of SSCL and uncover its connection to the classic data visualization method, stochastic neighbor embedding (SNE) (Hinton & Roweis, 2002), whose goal is to preserve pairwise distances. From the perspective of preserving neighboring information, SSCL can be viewed as a special case of SNE with the input space pairwise similarities specified by data augmentation. The established correspondence facilitates deeper theoretical understanding of learned features of SSCL, as well as methodological guidelines for practical improvement. Specifically, through the lens of SNE, we provide novel analysis on domain-agnostic augmentations, implicit bias and robustness of learned features. To illustrate the practical advantage, we demonstrate that the modifications from SNE to t-SNE (Van der Maaten & Hinton, 2008) can also be adopted in the SSCL setting, achieving significant improvement in both in-distribution and out-of-distribution generalization.

1. INTRODUCTION

Recently, contrastive learning, especially self-supervised contrastive learning (SSCL) has drawn massive attention, with many state-of-the-art models following this paradigm in both computer vision (He et al., 2020a; Chen et al., 2020a; b; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) and natural language processing (Fang et al., 2020; Wu et al., 2020; Giorgi et al., 2020; Gao et al., 2021; Yan et al., 2021) . In contrast to supervised learning, SSCL learns the representation through a large number of unlabeled data and artificially defined self-supervision signals, i.e., regarding the augmented views of a data sample as positive pairs and randomly sampled data as negative pairs. By enforcing the features of positive pairs to align and those of negative pairs to be distant, SSCL produces discriminative features with state-of-the-art performance for various downstream tasks. Despite the empirical success, the theoretical understanding is under-explored as to how the learned features depend on the data and augmentation, how different components in SSCL work and what are the implicit biases when there exist multiple empirical loss minimizers. For instance, SSCL methods are widely adopted for pretraining, whose feature mappings are to be utilized for various downstream tasks which are usually out-of-distribution (OOD). The distribution shift poses great challenges for the feature learning process with extra requirement for robustness and OOD generalization (Arjovsky et al., 2019; Krueger et al., 2021; Bai et al., 2021; He et al., 2020b; Zhao et al., 2023; Dong et al., 2022) , which demands deeper understanding of the SSCL methods. The goal of SSCL is to learn the feature representations from data. For this problem, one classic method is SNE (Hinton et al., 2006) and its various extensions. Specially, t-SNE (Van der Maaten & Hinton, 2008) has become the go-to choice for low-dimensional data visualization. Comparing to SSCL, SNE is far better explored in terms of theoretical understanding (Arora et al., 2018; Linderman & Steinerberger, 2019; Cai & Ma, 2021) . However, its empirical performance is not satisfactory, especially in modern era where data are overly complicated. Both trying to learn feature representations, are there any deep connections between SSCL and SNE? Can SSCL take the advantage of the theoretical soundness of SNE? Can SNE be revived in the modern era by incorporating SSCL? In this work, we give affirmative answers to the above questions and demonstrate how the connections to SNE can benefit the theoretical understandings of SSCL, as well as provide methodological guidelines for practical improvement. The main contributions are summarized below. • We propose a novel perspective that interprets SSCL methods as a type of SNE methods with the aim of preserving pairwise similarities specified by the data augmentation. • The discovered connection enables deeper understanding of SSCL methods. We provide novel theoretical insights for domain-agnostic data augmentation, implicit bias and OOD generalization. Specifically, we show isotropic random noise augmentation induces l 2 similarity while mixup noise can potentially adapt to low-dimensional structures of data; we investigate the implicit bias from the angle of order preserving and identified the connection between minimizing the expected Lipschitz constant of the SSCL feature map and SNE with uniformity constraint; we identify that the popular cosine similarity can be harmful for OOD generalization. • Motivated by the SNE perspective, we propose several modifications to existing SSCL methods and demonstrate practical improvements. Besides a re-weighting scheme, we advocate to lose the spherical constraint for improved OOD performance and a t-SNE style matching for improved separation. Through comprehensive numerical experiments, we show that the modified t-SimCLR outperforms the baseline with 90% less feature dimensions on CIFAR-10 and t-MoCo-v2 pretrained on ImageNet significantly outperforms in various domain transfer and OOD tasks.

Notations. For a function

f : Ω → R, let f ∞ = sup x∈Ω |f (x)| and f p = ( Ω |f (x)| p dx) 1/ p . For a vector x, x p denotes its p-norm, for 1 ≤ p ≤ ∞. P(A) is the probability of event A. For a random variable z, we use P z and p z to denote its probability distribution and density respectively. Denote Gaussian distribution by N (µ,Σ) and let I d be the d×d identity matrix. Let the dataset be D n = {x 1 ,•••,x n } ⊂ R d where each x i independently follows distribution P x . The goal of unsupervised representation learning is to find informative low-dimensional features z 1 ,•••,z n ∈ R dz of D n where d z is usually much smaller than d. We use f (x) to as the default notation for the feature mapping from R d → R dz , i.e., z i = f (x i ).

Stochastic neighbor embedding. SNE (Hinton & Roweis, 2002

) is a powerful representation learning framework designed for visualizing high-dimensional data in low dimensions by preserving neighboring information. The training process can be conceptually decomposed into the following two steps: (1) calculate the pairwise similarity matrix P ∈ R n×n for D n ; (2) optimize features z 1 ,•••,z n such that their pairwise similarity matrix Q ∈ R n×n matches P . Under the general guidelines lie plentiful details. In Hinton & Roweis (2002) , the pairwise similarity is modeled as conditional probabilities of x j being the neighbor of x i , which is specified by a Gaussian distribution centered at x i , i.e., when i = j, P j|i = exp(-x i -x j 2 2 /2σ 2 i ) k =i exp(-x i -x k 2 2 /2σ 2 i ) , (2.1) where σ i is the variance of the Gaussian centered at x i . Similar conditional probabilities Q j|i 's can be defined on the feature space. When matching Q to P , the measurement chosen is the KL-divergence between two conditional probabilities. et al., 2022) . In this work, we specifically consider SNE. If no confusion arises, we use SNE to denote the specific work of Hinton & Roweis (2002) and this type of methods in general interchangeably. Self-supervised contrastive learning. The key part of SSCL is the construction of positive pairs, or usually referred to as different views of the same sample. For each x i in the training data, denote



The overall training objective for SNE is Significant improvements have been made to the classic SNE. Im et al. (2018) generalized the KL-divergence to f -divergence and found that different divergences favors different types of structure. Lu et al. (2019) proposed to make P doubly stochastic so that features are less crowded. Most notably, t-SNE (Van der Maaten & Hinton, 2008) modified the pairwise similarity by considering joint distribution rather than conditional, and utilizes t-distribution instead of Gaussian in the feature space modeling. It is worth noting that SNE belongs to a large class of methods called manifold learning (Li

