YOUR CONTRASTIVE LEARNING IS SECRETLY DOING STOCHASTIC NEIGHBOR EMBEDDING

Abstract

Contrastive learning, especially self-supervised contrastive learning (SSCL), has achieved great success in extracting powerful features from unlabeled data. In this work, we contribute to the theoretical understanding of SSCL and uncover its connection to the classic data visualization method, stochastic neighbor embedding (SNE) (Hinton & Roweis, 2002) , whose goal is to preserve pairwise distances. From the perspective of preserving neighboring information, SSCL can be viewed as a special case of SNE with the input space pairwise similarities specified by data augmentation. The established correspondence facilitates deeper theoretical understanding of learned features of SSCL, as well as methodological guidelines for practical improvement. Specifically, through the lens of SNE, we provide novel analysis on domain-agnostic augmentations, implicit bias and robustness of learned features. To illustrate the practical advantage, we demonstrate that the modifications from SNE to t-SNE (Van der Maaten & Hinton, 2008) can also be adopted in the SSCL setting, achieving significant improvement in both in-distribution and out-of-distribution generalization.

1. INTRODUCTION

Recently, contrastive learning, especially self-supervised contrastive learning (SSCL) has drawn massive attention, with many state-of-the-art models following this paradigm in both computer vision (He et al., 2020a; Chen et al., 2020a; b; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) and natural language processing (Fang et al., 2020; Wu et al., 2020; Giorgi et al., 2020; Gao et al., 2021; Yan et al., 2021) . In contrast to supervised learning, SSCL learns the representation through a large number of unlabeled data and artificially defined self-supervision signals, i.e., regarding the augmented views of a data sample as positive pairs and randomly sampled data as negative pairs. By enforcing the features of positive pairs to align and those of negative pairs to be distant, SSCL produces discriminative features with state-of-the-art performance for various downstream tasks. Despite the empirical success, the theoretical understanding is under-explored as to how the learned features depend on the data and augmentation, how different components in SSCL work and what are the implicit biases when there exist multiple empirical loss minimizers. For instance, SSCL methods are widely adopted for pretraining, whose feature mappings are to be utilized for various downstream tasks which are usually out-of-distribution (OOD). The distribution shift poses great challenges for the feature learning process with extra requirement for robustness and OOD generalization (Arjovsky et al., 2019; Krueger et al., 2021; Bai et al., 2021; He et al., 2020b; Zhao et al., 2023; Dong et al., 2022) , which demands deeper understanding of the SSCL methods. The goal of SSCL is to learn the feature representations from data. For this problem, one classic method is SNE (Hinton et al., 2006) and its various extensions. Specially, t-SNE (Van der Maaten & Hinton, 2008) has become the go-to choice for low-dimensional data visualization. Comparing to SSCL, SNE is far better explored in terms of theoretical understanding (Arora et al., 2018; Linderman & Steinerberger, 2019; Cai & Ma, 2021) . However, its empirical performance is not satisfactory, especially in modern era where data are overly complicated. Both trying to learn feature representations, are there any deep connections between SSCL and SNE? Can SSCL take the advantage of the theoretical soundness of SNE? Can SNE be revived in the modern era by incorporating SSCL? In this work, we give affirmative answers to the above questions and demonstrate how the connections to SNE can benefit the theoretical understandings of SSCL, as well as provide methodological guidelines for practical improvement. The main contributions are summarized below. • We propose a novel perspective that interprets SSCL methods as a type of SNE methods with the aim of preserving pairwise similarities specified by the data augmentation. • The discovered connection enables deeper understanding of SSCL methods. We provide novel theoretical insights for domain-agnostic data augmentation, implicit bias and OOD generalization. Specifically, we show isotropic random noise augmentation induces l 2 similarity while mixup noise can potentially adapt to low-dimensional structures of data; we investigate the implicit bias from the angle of order preserving and identified the connection between minimizing the expected Lipschitz constant of the SSCL feature map and SNE with uniformity constraint; we identify that the popular cosine similarity can be harmful for OOD generalization. • Motivated by the SNE perspective, we propose several modifications to existing SSCL methods and demonstrate practical improvements. Besides a re-weighting scheme, we advocate to lose the spherical constraint for improved OOD performance and a t-SNE style matching for improved separation. Through comprehensive numerical experiments, we show that the modified t-SimCLR outperforms the baseline with 90% less feature dimensions on CIFAR-10 and t-MoCo-v2 pretrained on ImageNet significantly outperforms in various domain transfer and OOD tasks.

2. PRELIMINARY AND RELATED WORK

Notations. For a function f : Ω → R, let f ∞ = sup x∈Ω |f (x)| and f p = ( Ω |f (x)| p dx) 1/p . For a vector x, x p denotes its p-norm, for 1 ≤ p ≤ ∞. P(A) is the probability of event A. For a random variable z, we use P z and p z to denote its probability distribution and density respectively. Denote Gaussian distribution by N (µ,Σ) and let I d be the d×d identity matrix. Let the dataset be D n = {x 1 ,•••,x n } ⊂ R d where each x i independently follows distribution P x . The goal of unsupervised representation learning is to find informative low-dimensional features z 1 ,•••,z n ∈ R dz of D n where d z is usually much smaller than d. We use f (x) to as the default notation for the feature mapping from R d → R dz , i.e., z i = f (x i ). Stochastic neighbor embedding. SNE (Hinton & Roweis, 2002) is a powerful representation learning framework designed for visualizing high-dimensional data in low dimensions by preserving neighboring information. The training process can be conceptually decomposed into the following two steps: (1) calculate the pairwise similarity matrix P ∈ R n×n for D n ; (2) optimize features z 1 ,•••,z n such that their pairwise similarity matrix Q ∈ R n×n matches P . Under the general guidelines lie plentiful details. In Hinton & Roweis (2002) , the pairwise similarity is modeled as conditional probabilities of x j being the neighbor of x i , which is specified by a Gaussian distribution centered at x i , i.e., when i = j, P j|i = exp(-x i -x j 2 2 /2σ 2 i ) k =i exp(-x i -x k 2 2 /2σ 2 i ) , (2.1) where σ i is the variance of the Gaussian centered at x i . Similar conditional probabilities Q j|i 's can be defined on the feature space. When matching Q to P , the measurement chosen is the KL-divergence between two conditional probabilities. Hinton, 2008) modified the pairwise similarity by considering joint distribution rather than conditional, and utilizes t-distribution instead of Gaussian in the feature space modeling. It is worth noting that SNE belongs to a large class of methods called manifold learning (Li et al., 2022) . In this work, we specifically consider SNE. If no confusion arises, we use SNE to denote the specific work of Hinton & Roweis (2002) and this type of methods in general interchangeably. Self-supervised contrastive learning. The key part of SSCL is the construction of positive pairs, or usually referred to as different views of the same sample. For each x i in the training data, denote its two augmented views to be x i and x i . Let D n = {x 1 ,•••,x n }, D n = {x 1 ,•••,x n } and define l(x i ,x i ) = -log exp(sim(f (x i ),f (x i ))/τ ) x∈D n ∪D n \{x i } exp(sim(f (x i ),f (x))/τ ) , where sim(z 1 , z 2 ) = z1 z1 2 , z2 z2 2 denotes the cosine similarity and τ is a temperature parameter. The training objective of the popular SimCLR (Chen et al., 2020a) can be written as L InfoNCE := 1 2n n i=1 (l(x i ,x i )+l(x i ,x i )). Recently, various algorithms are proposed to improve the above contrastive learning. To address the need for the large batch size, MoCo (He et al., 2020a; Chen et al., 2020b) utilizes a moving-averaged encoder and a dynamic memory bank to store negative representations, making it more device-friendly. Grill et al. (2020) ; Chen & He (2021) ; Zbontar et al. (2021) ; Chen et al. (2021) radically discard negative samples in SSCL but still achieve satisfactory transfer performance. Another line of works (Caron et al., 2020; Li et al., 2021; Liu et al., 2022) mines the hierarchy information in data to derive more semantically compact representations. Radford et al. (2021) ; Yao et al. (2021) even extend the contrastive methods to the multi-modality data structure to achieve impressive zero-shot classification results. Theoretical understanding of SSCL. In contrast of the empirical success, theoretical understanding of SSCL is still limited. While most of theoretical works (Arora et al., 2019; Tosh et al., 2020; HaoChen et al., 2021; 2022; Wang et al., 2022; Wen & Li, 2021; Wei et al., 2020; Huang et al., 2021; Ji et al., 2021; Ma et al., 2023) focus on its generalization ability on downstream tasks, there are some works studying specifically the InfoNCE loss. One line of works (Oord et al., 2018; Bachman et al., 2019; Hjelm et al., 2018; Tian et al., 2019; 2020) understand the InfoNCE loss from mutual information perspective, showing that the negative InfoNCE is a lower bound of mutual information between positive samples. Other works (Wang & Isola, 2020; Huang et al., 2021; Jing et al., 2021) are from the perspective of geometry of embedding space, showing that InfoNCE can be divided into two parts: one controls alignment and the other prevents representation collapse. In this paper, we study SSCL from the SNE perspective, which, to the best of the authors' knowledge, has no discussion in existing literature. The closest work to ours is Balestriero & LeCun (2022) , which proposed a unifying framework under the helm of spectral manifold learning. In comparison, our work focus specifically on the connection between SSCL and SNE.

3. SNE PERSPECTIVE OF SSCL

A closer look at the training objectives of SNE and SimCLR reveals great resemblance -SimCLR can be seen as a special SNE model. To see this, denote D 2n = D n ∪ D n as the augmented dataset with index x 2i-1 = x i and x 2i = x i . If we change the l 2 distance to the negative cosine similarity and let σ 2 i ≡ τ . Admitting similar conditional probability formulation as in (2.1) yields that for i = j, Q j|i = exp(sim(f ( x i ),f ( x j ))/τ ) k =i exp(sim(f ( x i ),f ( x k ))/τ ) . (3.1) By taking P j|i = 1, if x i and x j are positive pairs 0, otherwise, (3.2) the SNE objective (2.2) can be written as 2n i=1 2n j=1 P j|i log P j|i Q j|i = n k=1 -log( Q 2k-1|2k )-log( Q 2k|2k-1 ) , which reduces to the SimCLR objective L InfoNCE , up to a constant scaling term only depending on n. Now that we have established the correspondence between SNE and SimCLR, it's clear that the feature learning process of SSCL also follows the two steps of SNE. (S1) The positive pair construction specifies the similarity matrix P . (S2) The training process then matches Q to P by minimizing some divergence between the two specified by the training objective, e.g., KL divergence in SimCLR. The main difference between SNE and SSCL is the first part, where the P in SNE is usually densely filled by l p distance, ignoring the semantic information within rich data like images and texts. In contrast, SSCL omits all traditional distances in R d and only specifies semantic similarity through data augmentations, and the resulting P is sparsely filled only by positive pairs as in (3.2). For structurally rich data such as image or text, the semantic information is invariant to a wide range of transformations. Human's prior knowledge of such invariance guides the construction of positive pairs in SSCL, which is then learned by the feature mapping. Remark 3.1 (SNE vs SSCL). We would like to clarify on the main difference between SNE and SSCL that we focus in this work. Although standard SNE (Hinton et al., 2006) is non-parametric without explicit feature maps, and is optimized for the whole dataset, these are not the defining properties of SNE. SNE can also utilize explicit feature maps and mini-batch training (Van Der Maaten, 2009) . On the other hand, SSCL can also benefit from larger/full batches (Chen et al., 2020a) and can also be modified to directly optimize the features z i 's. In this work, we omit these subtletiesfoot_0 and focus on the (S1) perspective, which we view as the most significant difference between SNE and SSCL.

3.1. ANALYSIS

In this section, to showcase the utility of the SNE perspective, we demonstrate how the feature learning process of SSCL methods, e.g., SimCLR, can become more intuitive and transparent. Specifically, we re-derive the alignment and uniformity principle (Wang & Isola, 2020) as well as provide novel analysis on domain-agnostic augmentations, the implicit bias and robustness of learned features. To aid the illustration, we device toy examples with simulated Gaussian mixture data. Gaussian mixture setting. Let the data follow d-dimensional Gaussian mixture distribution with m components where P x ∼ 1 m m i=1 N (µ i ,σ 2 I d ). The special case with d = 2, m = 5, σ = 0.1 is illustrated in Figure 1 (a) with 250 independent samples. To apply contrastive methods, consider constructing positive pairs by direct sampling, i.e., if x is from the first component, then we sample another x ∼ N (µ 1 ,σ 2 I d ) independently as its alternative view for contrast. The negative samples are the same as in standard SimCLR training.

3.1.1. DOMAIN-AGNOSTIC DATA AUGMENTATION

Now that we have established in (S1) that the input space pairwise distance is specified by the data augmentation, a natural question to ask is what are the corresponding induced distances. In this section, we investigate this problem for domain-agnostic data augmentations. The quality of data augmentation has great impact on the performance of SSCL methods, which reflects people's prior knowledge on the data. However, when facing new data without any domain knowledge, we have to rely on domain-agnostic data augmentations, e.g., adding random noises (Verma et al., 2021) , for contrast. We first consider using general random noise augmentation, i.e., for any x ∈ R d , let x = x+δ where δ follows some distribution with density φ(x). Then, for any x i , the probability density of having t ∈ R d as its augmented point can be characterized as P t|xi = P(x i and x i = t form a positive pair|x i ) = φ(t-x i ). We have the following proposition on Gaussian-induced distance. Proposition 3.2 (Gaussian noise injection). If the noise distribution is isotropic Gaussian with mean zero, the induced distance is equivalent to the l 2 distance in R d , up to a monotone transformation. Another popular noise injection method is the mixup (Zhang et al., 2017) , where the augmented data are comprised of convex combinations of the training data. For each x i , a positive pair can be constructed from another x j such that x i = x i + λ(x j -x i ) and λ ∈ (0,1) is the hyperparameter usually modeled with Beta distribution. For independent x 1 ,x 2 ∼ P x , denote the convoluted density of λ(x 1 -x 2 ) as p λ (x), which is symmetric around 0. Then, if employing mixup for positive pairs in SSCL, the induced distance can be written as P x1,x2 = P x2,x1 = p λ (x 1 -x 2 ). Gaussian vs. mixup. Verma et al. (2021) proposed to use mixup when domain-specific information is unattainable and provided supportive analysis on its advantage over isotropic Gaussian noise from the classification generalization error point of view. Through (S1) perspective, we can intuitively explain why data-dependent mixup noises can be potentially better from the perspective of the "curse of dimensionality". Consider the d-dimensional Gaussian mixture setting with m < d separated components. Notice that µ 1 ,••• ,µ m can take up at most (m -1)-dimensional linear sub-space of R d . Denoted the space spanned by µ i 's as S µ . For the light-tailed Gaussian distribution, and the majority of samples will be close to S µ . Hence, majority of the convoluted density p λ (x) will also be supported on S µ , so does the corresponding P x2,x1 . Thus, the induced distance from mixup will omit irrelevant variations in the complement of S µ and focus on the low-dimensional sub-space S µ where µ i 's actually differ. This effectively reduces the dimension dependence from d to m -1. In comparison, isotropic Gaussian noise induces l 2 distance for positive pairs with support of R d , which will be much more inefficient, especially when m d. Since it is well-known that the performance of regression or classification models is strongly influenced by the intrinsic dimension of the input space (Hamm & Steinwart, 2021) , keeping the data in a low-dimensional space is preferable.

3.1.2. ALIGNMENT AND UNIFORMITY

Characterizing the learned features of SSCL is of critical importance. Wang & Isola (2020) proposed alignment and uniformity as principles for SimCLR type contrastive learning methods. Such results can be intuitively understood through the perspective of (S1) and (S2). Consider the common case where the feature space is (d z -1)-sphere. First, (3.2) indicates that only similarities (distances) between positive pairs are non-zero (finite) and all other pairwise similarities (distances) are zero (infinity). Preserving (3.2) requires the features of positive pairs to align (cosine similarity tends to 1) and those of negative pairs to be as distant as possible. If in the extreme case where positive pairs match exactly, i.e., f (x i ) = f (x i ) for any i = 1,•••,n, we call it perfect alignment. If perfect alignment is achieved and the features are constrained on the unit sphere, matching (3.2) implies pushing n points on the feature space as distant as possible. Maximally separated n points on a d-sphere has been studied in geometry, known as the Tammes problem (Tammes, 1930; Erber & Hockney, 1991; Melisseny, 1998) . We say perfect uniformity is achieved if all the pairs are maximally separated on the sphere. There are some simple cases of the Tammes problem. If d = 2, perfect uniformity can be achieved if the mapped points form a regular polygon. If d ≥ n -1, the solution can be given by the vertices of an (n -1)-simplex, inscribed in an (n -1)-sphere embedded in R d . The cosine similarity between any two vertices is -1/(n -1) and in this case, L InfoNCE can attain its lower boundfoot_1 . As n → ∞, the point distribution converges weakly to uniform distribution. As can be seen in Figure 1(a, b ), perfect alignment and perfect uniformity are almost achieved by standard SimCLR in the Gaussian mixture setting. As we will demonstrate in Section 3.1.4 that the spherical feature space can be bad for OOD generalization, adopting of the Euclidean space will change the statement of the uniformity property and can also be analyzed from the SNE perspective. Details can be found in Appendix A.5.

3.1.3. IMPLICIT BIAS

Existing theoretical results on SSCL provide justification of its empirical success in classification. However, there is more to it than just separating different classes and many phenomena are left unexplained. Take the popular SimCLR (Chen et al., 2020a) on CIFAR-10 as an example, we can consistently observe that the feature similarities within animals (bird, cat, deer, dog, frog, horse) and within objects (airplane, automobile, ship, truck), are significantly higher than those between animals and objectsfoot_2 . This can be viewed as an implicit bias towards preserving semantic information, which might be surprising as we have no supervision on the label information during the training process. However, existing literature on implicit bias is scarce. As advocated in Saunshi et al. (2022) , ignoring inductive biases cannot adequately explain the success of contrastive learning. In this section, we provide a simple explanation from the perspective of SNE. For a more concrete illustration, consider training SimCLR in the Gaussian mixture setting with d = 1, d z = 2, m = 4, µ i = i, and σ = 0.1. Denote the 4 components in ascending order by A,B,C,D. Perfect alignment and uniformity imply that their feature maps (a, b, c, d) on the unit-circle should be vertices of an inscribed square. What left unsaid is their relative order. Clockwise or counter-Clockwise from a, regardless of the initialization, we can observe SimCLR to consistently produce the order a → b→ c→ d. Remark 3.3 (Relative ordering and neighbor-preserving). The order-preserving property showcased with d = 1 is mainly for illustration, as in one-dimension, the neighboring info is simplified as the order, which is much easier to understand. The results remain the same in high dimensions as long as the clusters are well separated with an obvious order of clusters. For instance, some relative orders in Figure 1(a, b ) are also stable, e.g., the neighbor of blue will consistently be purple and yellow. With great resemblance to SNE, SSCL methods also exhibit neighbor-preserving property and we identify it as an implicit bias. Such implicit bias can be universal in SSCL and the phenomenon in Figure A .3 is also a manifestation. In deep learning, the implicit bias is usually characterized by either closeness to the initialization (Moroshko et al., 2020; Azulay et al., 2021) , or minimizing certain complexity (Razin & Cohen, 2020; Zhang et al., 2021) . In the case of SimCLR, we hypothesize the implicit bias as the expected Lipschitz constant, which has deep connections to SNE with uniformity constraint. For a feature map f onto the unit-sphere, define C(f ) = E x,x f (x)-f (x ) 2 x-x 2 , (3.3) where the x 1 ,x 2 are independent samples from the data distribution. Definition 3.4 (SNE with uniformity constraint). Assume data x 1 ,•••,x n ∈ R d . If the corresponding SNE features z 1 , ••• , z n ∈ R dz are constrained to be the maximally separated n points on the (d z -1)-sphere, we call this problem SNE with uniformity constraint. The key of SNE is matching the pairwise similarity matrices Q to P . When solving SNE with uniformity constraint, the only thing to be optimized is the pairwise correspondence, or ordering of the mapping. We have the following theorem that links the neighbor-preserving property to C(f ). Theorem 3.5. Let x 1 ,••• ,x n ∈ R d such that x i -x j 2 > 0 for any i,j and let z 1 ,••• ,z n ∈ R dz be maximally separated n points on the (d z -1)-sphere. Denote P = (p ij ) n×n and Q = (q ij ) n×n as the corresponding pairwise similarity matrices of x i 's and z i 's respectively. Let π denote a permutation on {1,•••,n} and denote all such permutations as T . Let Q π as the π-permuted matrix Q and define C 1 (P, Q π ) = i =j q π(i)π(j) p ij and π * = argmin π∈T C 1 (P, Q π ). Then, π * also minimizes P -Q π F where • F is the Frobenius norm and P = (p ij ) n×n is a (monotonically) transformed similarity matrix with pij = -1/p ij . Theorem 3.5 showcases the relationship between minimizing C(f ) and the structure preserving property by considering a special SNE problem, where the pairwise similarity is not modeled by Gaussian as standard. Although q ij = -f (x i ) -f (x j ) 2 is unorthodox, it is reasonable since the larger the distance, the smaller the similarity. We have the following corollary to explain the neighborpreserving property of SSCL and the implicit bias associated with minimizing the complexity C(f ). Corollary 3.6 (Implicit bias of SSCL). When SSCL model achieves perfect alignment and perfect uniformity, if the complexity C(f ) is minimized, the resulting feature map preserves pairwise distance in the input space, resembling SNE with uniformity constraint. Corollary 3.6 links the implicit bias of SSCL to the SNE optimization with uniformity constraint. In the case of perfect alignment and perfect uniformity, SSCL can be seen as a special SNE problem where the feature z 1 ,•••,z n must be maximally separated on the unit-sphere. Recall the 1-dimension Gaussian case. There are in total 3! = 6 different orderings for the 4 cluster means, among which, a → b→ c→ d will give the lowest SNE loss. As can be seen in When the alignment or uniformity is not perfect, the resulting feature mapping can still be characterized via SNE, with the uniformity constraint relaxed as a form of regularization. In our numerical experiments on the CIFAR-10 data, we observe C(f ) to be monotonically decreasing during the training process, supporting our hypothesis. More details can be found in Appendix A.3. Corollary 3.6 sheds light on the implicit semantic information preserving phenomenon shown in Figure A.3, as in the input space, images of dogs should be closer to images of cats, than airplanes.

3.1.4. TARGETING OOD: EUCLIDEAN VS SPHERICAL

Almost all SSCL methods require normalization to the unit-sphere and the similarity on the feature space is often the cosine similarity. In comparison, standard SNE methods operate freely on the Euclidean space. In this section, we show that the normalization can hinder the structure-preserving and there is a fundamental trade off between in-distribution and out-of-domain generalization. Consider the 2-dimensional Gaussian mixture setting as illustrated in Figure 1 (a). Notice that as long as the mixing components are well separated, the learned feature mapping on the sphere will always be the pentagon shape, regardless of the relative locations of the clusters. This is a result of the uniformity property derived under spherical constraint. Distant clusters in the input space will be pulled closer while close clusters will be pushed to be more distant, which results in the trade off between in-distribution and out-of-domain generalization. On one hand, close clusters being more separated in the feature space is potentially beneficial for in-distribution classification. On the other hand, the spherical constraint adds to the complexity of the feature mapping, potentially hurting robustness. In the Euclidean space, pushing away negative samples (as distant as possible) will be much easier, since the feature vectors could diverge towards infinityfoot_3 and potentially preserve more structural information. To verify our intuition, we relax the spherical constraint in the Gaussian mixture setting and change the cosine similarity in SimCLR to the negative l 2 distance in R. The learned features are shown in Figure 1(c ). Comparing to Figure 1(b) , we can get the extra information that the purple cluster is far away to the others. If we introduce a small mean shift to the data, moving the distribution along each dimension by 1, the resulting feature maps differ significantly in robustness. As illustrated in Figure 1 

4. IMPROVING SSCL BY SNE

The proposed SNE perspective (S1,S2) can inspire various modifications to existing SSCL methods. In this section, we choose SimCLR as our baseline and investigate three straightforward modifications. For empirical evaluation, we report the test classification accuracy of nearest neighbor classifiers on both simulated data and real datasets. Experiment details can be found in Appendix C.

4.1. WEIGHTED POSITIVE PAIRS

In practice, positive pairs are constructed from anchors (training data), by i.i.d. data augmentations, e.g., random resized crop, random horizontal flip, color jitter, etc. Take random crop as an example, pair 1 and 2 may be from 30%, 80% random crops, respectively. Their similarities should not be treated as equal, as in typical SSCL methods. Incorporating the disparity in the data augmentation process is straightforward in the perspective of SNE, where the InfoNCE loss can be naturally modified as 1 2n n i=1 p ii •(l(x i ,x i )+l(x i ,x i )). The weight p ii in P can be specified manually to reflect human's prior knowledge. To test out the effect of such modification, we conduct numerical experiments on CIFAR-10 using the standard SimCLR. The weighting scheme is based on the Intersection over Union (IoU) of random resized crops. For each positive pair, let p ii ∝ exp(IoU(x i ,x i )/τ ), where τ > 0 is a hyperparameter (temperature) controlling the strength of the weighting scheme, i.e., the bigger the τ , the closer to the unweighted state. The CIFAR-10 test performance vs. τ is shown in Figure 2 (a). The baseline is 80.7% and can be significantly improved to 82.1% if choosing τ = 1.

4.2. T-SIMCLR: t-SNE STYLE MATCHING

Most SSCL algorithms differ mainly in (S2), i.e., defining Q and matching it to P , where fruitful results in SNE literature can be mirrored and applied. Now that we have identified the advantage of modeling features in Euclidean spaces in Section 3.1.4, the most promising modification that follows is to introduce t-SNE to SimCLR. Since we are learning low-dimensional features from high-dimensional data, preserving all pairwise similarities is impossible and the features tend to collapse. This is referred to as the "crowding problem" in Van der Maaten & Hinton (2008) (see Section 3.2 therein). t-SNE utilizes the heavy-tail t-distribution instead of the light-tail Gaussian, to model Q and encourage separation in feature space. Correspondingly, the training objective L InfoNCE can be modified as 1 n n i=1 -log 1+ f (x i )-f (x i ) 2 2 /(τ t df ) -(t df +1)/2 1≤j =k≤2n (1+ f ( x j )-f ( x k ) 2 2 /(τ t df )) -(t df +1)/2 , (4.1) where t df is the degree of freedom for the t-distribution. Besides substituting the cosine similarity to the l 2 distance, the key modification is the modeling of feature space similarity Q, from Gaussian to t-distribution as suggested by Van der Maaten & Hinton (2008) to avoid the crowding problem and accommodate the dimension-deficiency in the feature space. We call the modified method t-SimCLR and we expect it to work better, especially when the feature dimension is low, or in the OOD case. Mialon et al. (2020) . On the other hand, standard SNE methods can also borrow existing techniques in SSCL to improve their performance on more complicated data, e.g., incorporating data augmentations instead of or on top of pre-defined distances. In this sense, by choosing feature dimension to be 2, various SSCL methods can also be used as data visualization tools (Böhm et al., 2022; Damrich et al., 2022) . Specifically on CIFAR-10, standard t-SNE can barely reveal any clusters while our t-SimCLR with d z = 2 produces much more separation among different labels. More details can be found in Appendix C.7. Recall the domain-agnostic data augmentation process. For any x i , the probability density of having t ∈ R d as its augmented point can be characterized as P t|xi = P(x i and x i = t form a positive pair |x i ) = φ(t-x i ). For isotropic Gaussian densities with mean 0 and covariance matrix σ 2 I, φ(t -x i ) ∝ exp(-t-x i 2 2 /2σ 2 ), which is monotonic with the l 2 distance between t and x i .

A.3 INVESTIGATIONS ON C(f ).

Figures A.4 and A.5 illustrate the evolution of different complexity measurements during the training process under the Gaussian mixture setting and the CIFAR-10 respectively. In the Gaussian mixture setting, the feature extractor is a fully connected ReLU network. Besides C(f ), we also evaluate the popular sum of squared weights. The observations on SimCLR are listed as below: • The expected Lipschitz constant C(f ) is small in initialization. It first increases (till around 100 iterations) and then consistently decreases. This empirically supports the implicit bias towards minimizing C(f ). • C(f ) and the sum of squared weights share very similar patterns. • The SNE loss is non-increasing, as if we are doing stochastic neighbor embedding using l 2 -distance. In the CIFAR-10 case, the feature extractor is ResNet-18 plus a fully-connected projection layer. The output from ResNet-18 is usually called representation (512 dimensional) and is utilized for downstream tasks while the projection (128 dimension) is used for training. Such a representationprojection set up is common in SSCL. Ma et al. (2023) aimed to decipher the projection head and revealed that the projection feature tends to be more uniformly distributed while the representation feature exhibits stronger alignment. Besides C(f ), we also evaluate the l 2 -norm of the representation. • C(f ) for the projection layer shares similar patterns as in the Gaussian mixture case, first increase and then decreases. However, C(f ) for the representation layer monotonically decreases. • C(f ) for the projection layer and the l 2 -norm in the representation layer share almost identical patterns. • Comparing SimCLR, both the the calculated C(f ) and l 2 -norm are much smaller for t-SimCLR. In conclusion, on one hand, our empirical results demonstrate that the complexity of the feature extractor C(f ) does decrease during training and seem to be implicitly minimized. On the other hand, its trend is shared with other more popularly used complexity measurements. A.4 PROOF OF COROLLARY 3.6 In this section, we illustrate with rigor how the hypothesized implicit bias can give rise to structurepreserving property of SSCL. Corollary 3.6 states that minimizing the (Lipschitz) complexity of the feature mapping will also result in the best match between P and Q (under permutation). To provide more theoretical insight, we present the following lemma in the simpler vector-matching case. Lemma A.1. Let 0 < x 1 < ••• < x m and 0 < y 1 < ••• < y m be two real-valued sequences, normalized such that m i=1 x 2 i = m i=1 y 2 i = 1. Consider a permutation π of {1, ••• , m} and denote all such permutations as T . Then argmin π∈T m i=1 y π(i) x i = argmin π∈T m i=1 (x i -y π(i) ) 2 := π * , where π * (i) = i for all i = 1,•••,m. Proof. By the rearrangement inequality, we have m i=1 y π(i) x i ≥ m i=1 y i x i . Similarly, Lemma A.1 gives a vector-version illustration of our Corollary 3.6, stating that minimizing the expected derivative (to zero) of the mapping function f , i.e., i f (x i )/x 1 leads to preserving the norm difference of the input vector and output vector. m i=1 (x i -y π(i) ) 2 = m i=1 x 2 i + m i=1 y 2 i -2 m i=1 x i •y π(i) ≥ 2-2 m i=1 x i •y i . C(f) ||f|| 2 3URMHFWLRQ&I 5HSUHVHQWDWLRQ&I 5HSUHVHQWDWLRQ||f||2 Next, we provide the proof of Theorem 3.5. Proof of Theorem 3.5. Straightforwardly, we can write P -Q π F = i =j 1 p ij +q π(i)π(j) 2 = i =j 1 p 2 ij + i =j q π(i)π(j) 2 +2 i =j q π(i)π(j) p ij = 2C 1 (P,Q π )+ i =j 1 p 2 ij + i =j q ij 2 Thus, minimizing C 1 (P,Q π ) also minimizes P -Q π F . Theorem 3.5 is a straightforward generalization of Lemma A.1. Next, we provide proof for Corollary 3.6, restated below. Proof of Corollary 3.6. Recall the SimCLR loss L InfoNCE = 1 2n n i=1 (l(x i ,x i )+l(x i ,x i )), where l(x i ,x i ) = -log exp(sim(f (x i ),f (x i ))/τ ) x∈Dn∪D n \{xi} exp(sim(f (x i ),f (x))/τ ) . Without loss of generality, let τ = 1. Notice that l(x i , x i ) is monotonically decreasing as sim(f (x i ),f (x i )) increases, due to the monotonicity of function x x+c with respect to x > 0 for any c > 0. Hence, in order for L InfoNCE to be minimized, perfect alignment is required, i.e., f (x i ) = f (x i ) for any i = 1,...,n. With perfect alignment achieved, L InfoNCE only concerns the pairwise similarity between negative samples f (x i )'s, which can be simplified as L InfoNCE ≥ L uniform where L uniform = 1 n n i=1 -log e e+ j =i exp(sim(f (x i ),f (x j ))) ≥ log   1 n n i=1   1+ 1 e j =i exp(sim(f (x i ),f (x j )))     ≥ log   1+ 1 n•e 1≤i =j≤n exp(sim(f (x i ),f (x j )))   . L uniform can be minimized by mapping x i 's as distant as possible, hence the connection to Tammas problem and the uniformity principle. With sufficient capacity of the feature mapping f , the SimCLR loss can be minimized to its (empirical) global minima. However, such f is not unique since L InfoNCE is invariant to permutations of mapping relationships from x i to f (x i ). If f * n further minimizes C(f ) on the sample level, i.e., f * n := argmin f C n (f ) = argmin f 1≤i =j≤n f (x i )-f (x j ) 2 x i -x j 2 , Then, f * n also solves a type of SNE problem with uniformity constraint (3.4) as stated in Theorem 3.5. To see this, if we define q ij = -f (x i )-f (x j ) 2 and p ij = -x i -x j 2 , which is reasonable since the larger the distance, the smaller the similarity, we can directly apply the results in Theorem 3.5. Remark A.2. As can be seen from Theorem 3.5 and the proof of Corollary 3.6, we showcase the relationship between minimizing C(f ) and structure preserving property by considering a special SNE problem, where the pairwise similarity is not modeled by Gaussian as standard, hence the word "resembling" in Corollary 3.6. Although q ij = -f (x i )-f (x j ) 2 is unorthodox, it is reasonable since the larger the distance, the smaller the similarity. If we consider the SNE method as in Hinton et al. (2006) , our proof does not go through directly and demands more complicated analysis. However, our results are still valid in connecting the complexity of the feature map to the pairwise similarity matching. Our statement in Corollary 3.6 requires perfect alignment or perfect uniformity. When the assumptions are not perfectly met, we can still obtain insights for the resulting feature mapping. Alignment and uniformity (Wang & Isola, 2020) is not the whole story of contrastive learning, and our identified structure-preserving property implicitly induced by complexity minimization provides an other angle of the learning process. From this perspective, contrastive learning can be thought of as a combination of alignment and SNE with uniformity constraint. In Figure A.3, while obtaining approximate alignment and uniformity, the feature mapping also preserves the relative relationships of the clusters (labels).

A.5 ALIGNMENT AND UNIFORMITY OF T-SIMCLR

Due to the change of training objective, we may want to reevaluate the properties of the learned feature from t-SimCLR. We will show that alignment still hold while uniformity is changed (to infinity). Let us consider a compact region Ω ⊂ R d and x i ∈ Ω. Let t be the transformation such that the augmented data point x i = t(x i ) is still in Ω. Wang & Isola (2020) showed that the contrastive loss can be decomposed into the alignment loss and the uniformity loss. Zimmermann et al. ( 2021) further showed that the contrastive loss converges to the cross-entropy between latent distributions, where the underlying latent space is assumed to be uniform, and the positive pairs are specified to be an exponential distribution. In this section, we show a parallel result, which states that in the population level, the t-SNE loss is the cross-entropy between two distributions of generating positive pairs. Theorem A.3. Let H(•,•) be the cross entropy between distributions. Let p(x) be the density of x, p(•|x) be the conditional density of generating a positive pair, and define q f (x |x) = C f (x) -1 p(x ) 1+ f (x)-f (x ) 2 2 , with C f (x) = Ω p(x ) 1+ f (x)-f (x ) 2 2 dx . Then, we have E x∼p(x) (H(p(•|x),q f (•|x)) = L a (f )+L u (f ), (A.1) which corresponds to the population-level t-SimCLR loss where L a =E x∼p(x) E x∼p(x |x) log(1+ f (x)-f (x ) 2 2 ), L u =E x∼p(x) logE x∼p( x) (1+ f (x)-f ( x) 2 2 ) -1 . Proof. Note that H(p(•|x),q f (•|x)) =- Ω p(x |x)log p(x ) 1+ f (x)-f (x ) 2 2 dx +logC f (x) = Ω p(x |x)log(1+ f (x)-f (x ) 2 2 )dx - Ω p(x |x)log(p(x ))dx +log Ω p(x ) 1+ f (x)-f (x ) 2 2 dx = Ω p(x |x)log(1+ f (x)-f (x ) 2 2 )dx - Ω p(x |x)log(p(x ))dx +logE x ∼p(x ) (1+ f (x)-f (x ) 2 2 ) -1 . Taking expectation with respect to x leads to E x∼p(x) H(p(•|x),q f (•|x)) =E x∼p(x) E x ∼p(x |x) log(1+ f (x)-f (x ) 2 2 )+E x∼p(x) logE x∼p( x) (1+ f (x)-f ( x) 2 2 ) -1 - Ω Ω p(x)p(x |x)log(p(x ))dx dx =L a (f )+L u (f )-C p , where C p = Ω Ω p(x)p(x |x)log(p(x ))dx dx = Ω Ω p(x,x )log(p(x ))dx dx does not depend on f . E x∼p(x) H(p(•|x),q f (•|x)) = Ω p(x) 1 p(x) Ω p(x,x )log p(x ) 1+ f (x)-f (x ) 2 2 dx dx - Ω Ω p(x)p(x ) 1+ f (x)-f (x ) 2 2 dxdx = Ω Ω p(x,x )log p(x ) 1+ f (x)-f (x ) 2 2 dx dx - Ω Ω p(x)p(x ) 1+ f (x)-f (x ) 2 2 dxdx . In Theorem A.3, L a is the alignment loss and L u is the uniformity loss. The decomposition is much more natural for t-SimCLR as opposed to that in L InfoNCE , mainly due to the change from conditional to joint distribution when modeling the pairwise similarity. Furthermore, if the t-SimCLR loss is minimized, we must have p(•|x) = q f (•|x), provided f has sufficient capacity. Note that if p(•|x) = q f (•|x), then P j|i and Q j|i are perfectly matched, which indicates that we obtain a perfect neighbor embedding. Theorem A.3 implies that the optimal feature mapping f * satisfies p( •|x) = q f * (•|x), which further implies that for any x ∈ Ω, C f * (x) -1 p(x ) 1+ f * (x)-f * (x ) 2 2 ∝ C(x) -1 p(x |x) ⇔ C f * (x) -1 1 1+ f * (x)-f * (x ) 2 2 ∝ C(x) -1 p(x,x ) p(x)p(x ) , (A.2) where C(x) = p(x |x)dx . Unlike the usual normalized SimCLR, t-SNE does not assume any special structure on f (e.g., f 2 = 1), thus f can go to infinity. Comparing to the finite sample t-SimCLR loss, the population version is trickier to analyze. This is because for a given point x , it can be an augmented sample of some x (with probability p(x |x)), or a negative sample of x (when we treat x as another sample point). This reflects the essential difficulty between population and finite samples in contrastive learning, not only for t-SimCLR. For clustered data, (A.2) provides two important messages, provided that the augmentation is not too extreme and the augmented sample x stays in the same cluster as the original x. On one hand, when x 1 and x 2 belongs to different clusters, the joint density p(x = x 1 ,x = x 2 ) will be very small, close to zero, which indicates that f * (x 1 )-f * (x 2 ) 2 is very large, tending to infinity. On the other hand, for x 1 and x 2 belonging to the same cluster, p(x = x 1 ,x = x 2 ) will be relatively large. Hence, the features of the same cluster will stay close. Overall, we will observe similar clustered structure in the feature space. This is confirmed in the Gaussian mixture setting in Figure 1 (c), in which case, the problem can be oversimplified as mapping 5 points in R 2 to the unit-circle.

B CONNECTION TO DISTANCE BETWEEN DISTRIBUTIONS

Through the lens of stochastic neighbor embedding, the feature learning process of SSCL methods can be seen as minimizing certain "distances" between distributions in different dimensions. Ideally, the feature should preserve the distributional information about the data. Since the data and the feature do not lie in the same metric space, quantitatively measuring their distributional distance is difficult. Fortunately, there are existing tools we can utilize, specifically, Gromov-Wasserstein distance (Mémoli, 2011; Salmona et al., 2021) . Let X , Z be two Polish spaces, each endowed respectively with probability measures p x and p z . Given two measurable cost functions c x : X × X → R, c z : Z × Z → R, and D : R × R → R, the Gromov-Wasserstein distance can be defined as GW p (p x ,p z |c x ,c z ) := inf π∈ (px,pz) X 2 ×Z 2 D(c x (x,x ),c z (z,z )) p dπ(x,z)dπ(x ,z ) 1/p , where (p x ,p z ) denotes all the joint distributions in X × Z such that the marginals are p x and p z . Typically, D(c x ,c z ) is chosen to be |c x -c z | and c x (x,x ) is usually chosen to be x-x p . The key idea of the Gromov-Wasserstein distance to circumvent the dimension mismatch is to change from comparing marginal distribution to pairwise distributions, which is very similar to the SNE objective. Consider Monge's formulation of the optimal transportation problem and let z = f (x). By choosing c z (z i ,z j ) = log( Q j|i ) with Q specified as in (3.1), c x (x i ,x j ) = P j|i with P specified as in (3.2) and letting D(c x ,c z ) = c x (log(c x )-log(c z )), we have GW 1 (p x ,p f (x) ) ≤ E x,x (D(c x (x,x ),c z (f (x),f (x )))), where the right hand side recovers the expected InfoNCE loss. Hence, the SNE perspective can also be viewed as minimizing the Gromov-Wasserstein distance between p z and p x . It is worth noting that such an interpretation only relates to contrastive learning, not including generativebased self-supervised learning methods such as Masked AutoEncoder (MAE) (He et al., 2021) . 

C EXPERIMENT DETAILS

C.1 CIFAR-10 SETTINGS CIFAR-10 (Krizhevsky, 2009) is a colorful image dataset with 50000 training samples and 10000 test samples from 10 categories. We use ResNet-18 (He et al., 2016) as the feature extractor, and the other settings such as projection head all follow the original settings of SimCLR (Chen et al., 2020a) . To evaluate the quality of the features, we follow the KNN evaluation protocol (Wu et al., 2018) . which computes the cosine similarities in the embedding space between the test image and its nearest neighbors, and make the prediction via weighted voting. We train each model with batch size of 256 and 200 epochs for quicker evaluation. For t-SimCLR, without specifying otherwise, we grid search the t df and τ with range {1, 2, 5, 10} and {1, 2, 5, 10} respectively. to the original size), horizontal flip, color jitter (randomly change the brightness, contrast, saturation and hue of an image). To illustrate the natural weighting scheme in Section 4.1, we considered random resized crop and specifies the weights by the IoU (intersection over union) of the positive pair. In particular, two augmented images are created from an anchor image. Each augmentation crops a rectangular region of the image, denoted by r 1 ,r 2 respectively, and their IoU is defined by the area of intersection r 1 ∩r 2 divided by the area of the union r 1 ∪r 2 . The IoU is always between 0 and 1. In our experiment, we chose the default settings and • The gain of extra dimensions in the OOD case does vanish later than that in the in-distribution case.

Ablation of training epochs

• The advantage of SimCLR vs. t-SimCLR is very significant with around 10% improvement when d = 128 using nearest neighborfoot_4 classification, indicating that t-SimCLR produces better separated clusters. Relationship between t df and d z . The larger the degree of freedom t df , the less heavy-tail the t-distribution. As d z decreases, the crowding problem becomes more severe and as recommended by (Van der Maaten & Hinton, 2008) , a smaller t df tends to work better. We evaluate the sensitivity of t df (1, 5, 10) under different choices of d z (1, 2, 4, 8, 16, 32, 64, 128) in CIFAR-10 and the results are reported in Figure C.9. As can be seen, when d z is small (1,2,4,8), t df = 1 outperforms. Comparing t df = 5 and t df = 10, the two perform similarly when d z is large (16, 32, 64, 128) but the smaller t df = 5 yields better accuracy when d z = 1,2,4. Tuning temperature vs. tuning t df . As illustrated in Section 4.2, when the feature space dimension is low, the heavy-tailed t-distribution is a better choice than Gaussian to alleviate the crowding problem. Even though tuning the temperature of L InfoNCE , i.e., making τ larger, can also have the effect of making the distribution less concentrated (τ can be seen as the standard deviation), tuning temperature and tuning t df are fundamentally different. The former is controlling how fast does the similarity Q i,j decays as the distance between z i and z j increases, while the latter serves as a scaling factor, offering constant level modification of the scheme. In our experiments with SimCLR vs t-SimCLR on CIFAR-10, temperature is tuned as a hyperparameter. The difference in τ can never make up to the difference between the baseline SimCLR and t-SimCLR. We found τ = 0.5 to work better for the base SimCLR while larger τ works better with our t-SimCLR. We recommend τ = 5 as the default choice.

C.4 IMAGENET PRE-TRAINING

To show the ability for large scale domain transfer and OOD generalization, we conduct experiments on ImageNet pre-training based on MoCo-v2 with its official implementationfoot_5 . We follow most of their settings, e.g, data augmentation, 200 epochs pre-training, and optimization strategy, etc. The loss is modified according to Section 4.2 and batch normalization is applied along every dimension. We grid search the t df and τ with range {2, 5, 10, 15} and {0.2, 2, 5, 10} respectively. Finally we choose t df = 10 and τ = 5 to be the optimal hyperparameters. We use this pre-train model as initialization for domain transfer and OOD experiments.

C.5 DOMAIN TRANSFER

We compare MoCo-v2 pre-trained with 800 / 200 epochs and t-MoCo-v2 on Aircraft, Birdsnap, Caltech101, Cars, CIFAR10, CIFAR100, DTD, Pets, and SUN397 in Table C .3. We follow the transfer settings in Ericsson et al. (2021) to finetune the pre-trained models. For datasets Birdsnap, Cars, CIFAR10, CIFAR100, DTD, and SUN397, we report the top-1 accuracy metric, while for Aircraft, Caltech101, and Pets, we report the mean per-class accuracy metric. We also follow Ericsson et al. (2021) to split each dataset into training, validation, and test sets. On each dataset, we perform a hyperparameter search as follows. (1) We choose the initial learning rate according to a grid of 4 logarithmically spaced values between 1×10 -4 and 1×10 -1 ; (2) We choose the weight decay parameter according to a grid of 4 logarithmically spaced values between 1×10 -6 and 1×10 -3 , plus no weight decay; (3) The weight decay values are divided by the learning rate; (4) For each pair of learning rate and weight decay, we finetune the pre-trained model for 5000 steps by SGD with Nesterov momentum 0.9, batch size of 64, and cosine annealing learning rate schedule without restarts. As can be seen in Table C .3, our t-MoCo-v2 with 200 epochs even outperform the baseline with 800 epochs on average.

C.6 OOD GENERALIZATION

To demonstrate the advantage of our modification, we also compare MoCo-v2 pre-trained with 800 / 200 epochs and t-MoCo-v2 on OOD generalization benchmarks: PACS Li et al. (2017) , VLCS Fang et al. (2013) , Office-Home Venkateswara et al. (2017) . We follow the standard way to conduct the Hinton, 2008) and its variants are designed for data visualization. However, for more complicated data, such as colored images, the results are not satisfactory. Using standard t-SNE, the 2D visualization of the 50K training images of CIFAR-10 (labels denoted as 0, 1,...,9) can be seen in Figure C .10, where different labels are hardly separated. The poor performance of t-SNE on CIFAR-10 can be traced back to the poor distance choice on images, i.e., l 2 -norm. Inspired by the success of SSCL for natural images, t-SNE can potentially be improved by incorporating data augmentations. In light of our perspective (S1), t-SNE can take advantage of the distance specified with (3.2) and the resulting model is essentially our t-SimCLR with feature dimension 2. The visualization from t-SimCLR is shown in Figure C .11, which is much more separated (the nearest neighbor classification accuracy on CIFAR-10 test data is 56.6%). By choosing the feature dimension to be 2, various SSCL methods can also be made into data visualizing tools. In Figure C .12, we visualize the outcome from SimCLR (the nearest neighbor classification accuracy on CIFAR-10 test data is 24.8%). 



All the contrastive losses are written in full batches for simplicity in this work as we focus on analyzing the optimal solutions of SSCL methods rather than the optimization process. Notice that in this case, the optimal feature mapping will contain little information of the data, mapping anchor samples to interchangeable points with identical pairwise distances Figure A.3 illustrates the phenomenon. Details can be found in Appendix A.1 In practice, various regularization, e.g, weight decay, are employed and the resulting features will be bounded. When evaluating by training linear classifiers for 100 epochs, the accuracy for SimCLR is 46.4% and that for t-SimCLR is 48.14% (averaged over 3 replications). https://github.com/facebookresearch/moco



Figure 1: Gaussian mixture setting with 5 components. (a) illustration of data with 250 samples. (b) learned features by standard SimCLR with normalization (cosine similarity) to 1-sphere. (c) learned features by modified SimCLR without normalization (l 2 similarity). (d, e) feature mapping of the two methods in case of OOD mean shift. The linear classification accuracy is 48.4% in (d) and 100% in (e).

Figure A.4, both C(f ) and the SNE loss are monotonically decreasing during training for the Gaussian mixture setting.

(d) vs. (e), the standard SimCLR are much less robust to OOD shifts and the resulting classification accuracy degrades to only 48.4%, while that for the modified SimCLR remains 100%. The same OOD advantage can also be verified in the CIFAR-10 to CIFAR-100 OOD generalization case (details in Appendix C.3 Figure C.8) and large-scale real-world scenarios with MoCo (Chen et al., 2020b) as baseline (details in Section 5).

Published as a conference paper at ICLR 2023 (a) Weighted SimCLR. (b) SimCLR vs. t-SimCLR.

Figure 2: Nearest neighbor classification test accuracy on CIFAR-10 with ResNet-18 after 200 epochs pre-training. (a) "N/A" stands for the baseline SimCLR. The x-axis is the temperature for IoU weighting scheme. (b) Comparison between SimCLR and t-SimCLR with different feature dimensions.

Figure 2(b)  shows the comparison between SimCLR and t-SimCLR on CIFAR-10 with different feature dimensions, where t-SimCLR has significant advantages in all cases and the smaller the d z , the larger the gap. Without decreasing the standard d z = 128, t-SimCLR improves the baseline from 80.8% to 83.9% and even beats it using only d z = 8 with accuracy 81.7%. Remark 4.1 (Degree of freedom). Standard t-SNE utilizes t-distribution with t df = 1, to better accommodate the extreme d z = 2 case. In practice, t df can vary and as d z increases, larger t df might be preferred. We recommend using t df = 5 as the default choice. The performance of t df vs d z can be found in Appendix C, as well as discussion on the fundamental difference between t df and τ . Remark 4.2 (Training epochs). For the CIFAR-10 experiments, we reported the results of ResNet-18 after 200 training epochs, similar to the setting ofYeh et al. (2021). We also conducted 1000-epoch experiments and found that our modifications provide consistent improvements throughout the training process, not in terms of speeding up the convergence, but converging to better solutions. Details can be found in Appendix C.1 and Figure C.6.

Figure A.3: Cosine similarity heat map of learned features from SimCLR on CIFAR-10 dataset. The darker the color, the larger the similarity.

Figure A.3 plots the cosine similarity heat map of learned features from SimCLR on CIFAR-10 dataset. To calculate the similarity of class A (figures denoted by a i ) to class B (figures denoted by b i ), we first calculate the mean of b i as b. Then, we sum up i sim(a i , b) and plot is with colors. Hence, the similarity matrix shown in Figure A.3 is not symmetric.

Figure A.4: Empirical evaluation on the complexity of the learned feature mapping during training under the Gaussian mixture setting. Two complexity measurements are considered, i.e., C(f ) as in (3.3) and the SNE loss as in (2.2). The SNE loss here only serves as in indicator for how well the pairwise distances are preserved. The training objective is the standard InfoNCE loss. The SNE loss decreases quickly until in the first 100 iterations and stays flat.

t-SimCLR on CIFAR-10.

Figure A.5: Empirical evaluation on the complexity of the learned feature mapping during training on CIFAR-10. Two complexity measurements are considered, i.e., ) as in (3.3) and l 2 -norm. Specifically, we calculate the expected Lipschitz constant on both the representation layer (512dimensional) and the projection layer (128-dimensional). Figure (a) and (b) show the trends (along the 200 training epochs) for SimCLR and t-SimCLR respectively.

Figure C.6: Nearest neighbor test accuracy vs. training epochs. SimCLR and t-SimCLR share similar trends and convergence speed.

Figure C.7: The histogram of IoUs for 1000 constructed positive pairs in CIFAR-10. The empirical distribution is almost symmetric around 0.5.

Figure C.8: Extension on Figure 2(b). Nearest neighbor classification accuracy for SimCLR vs. t-SimCLR on both CIFAR-10 (in-distribution) and CIFAR-100 (out-of-distribution) using different feature dimensions.

Figure C.7 illustrates the IoU histogram of 1000 constructed positive pairs.C.3 DEGREE OF FREEDOM IN t-SIMCLRFeature dimension efficiency in OOD case. To further investigate the generalization ability of SSCL methods, we devise a challenging setting where the model is trained on CIFAR-10 and tested on CIFAR-100 classification. In this case, we evaluate the effect of increasing feature dimensions in the projection layer, as an extension on the CIFAR-10 in-distribution case. The results are shown in Figure C.8, where there are two things to note:

Figure C.9: Nearest neighbor classification accuracy on CIFAR-10 for t-SimCLR using different feature dimensions and different degrees of freedom (t_df).

Böhm et al. (2022);Damrich et al. (2022) where they focused specifically on data visualization and stochastic neighbor embedding.

Figure C.10: 50K CIFAR-10 training images visualization in 2D with t-SNE.

Figure C.11: 50K CIFAR-10 training images visualization in 2D with the default t-SimCLR.

Figure C.12: 50K CIFAR-10 training images visualization in 2D with the SimCLR.

The overall training objective for SNE is

). Standard t-SNE utilizes t-distribution with t df = 1, to better accommodate the extreme d z = 2 case. In practice, t df can vary and as d z increases, larger t df might be preferred. We recommend using t df = 5 as the default choice. The performance of t df vs d Domain transfer results of vanilla MoCo-v2 and t-MoCo-v2.

OOD accuracies of vanilla MoCo-v2 and t-MoCo-v2 on domain generalization benchmarks. Both models are pre-trained for 200 epochs on ImageNet following the setting ofChen  et al. (2020b). The linear probing accuracy of t-MoCo-v2 on ImageNet is 67.0%, which is comparable to the MoCo result 67.5%. With the same level of in-distribution classification accuracy, we conduct extensive experiments to compare their OOD performance. The results in Table1and 2 suggest that our modification significantly improves the domain transfer and the OOD generalization ability without sacrificing in-distribution accuracy.

We also run the SimCLR and t-SimCLR experiments in the more standard 1000 epochs setting. For SimCLR, we use batch size of 512, learning rate of 0.3, temperature of 0.7, and weight dacay of 0.0001. For t-SimCLR, we use batch size of 512, learning rate of 0.8, temperature of 10, weight dacay of 0.0002, and t df = 5. The nearest neighbor accuracy for SimCLR is 87.2% vs. that for t-SimCLR is 88.8%.

3: Domain transfer results of vanilla MoCo-v2 and t-MoCo-v2.TableC.4: OOD accuracies of vanilla MoCo-v2 and t-MoCo-v2 on domain generalization benchmarks. experiments, i.e., choosing one domain as the test domain and using the remaining domains as training domains, which is named the leave-one-domain-out protocol. The top linear classifier is trained on the training domains and tested on the test domain. Each domain rotates as the test domain and the average accuracy is reported for each dataset in TableC.4. On each dataset, we perform a hyperparameter search following DomainBedGulrajani & Lopez-Paz (2021). We adopt the leave-one-domain-out cross-validation setup in DomainBed with 10 experiments for hyperparameter selection and run 3 trials. As can be seen in TableC.4, our t-MoCo-v2 with 200 epochs even significantly outperform the baseline with 800 epochs for all of the three datasets.

