RETHINKING UNIFORMITY IN SELF-SUPERVISED REP-RESENTATION LEARNING

Abstract

Self-supervised representation learning has achieved great success in many machine learning tasks. Many research efforts tends to learn better representations by preventing the model from the collapse problem. Wang & Isola (2020) open a new perspective by introducing a uniformity metric to measure collapse degrees of representations. However, we theoretically and empirically demonstrate this metric is insensitive to the dimensional collapse. Inspired by the finding that representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity, we propose to use the Wasserstein distance between the distribution of learned representations and its ideal distribution with maximum uniformity as a quantifiable metric of uniformity. To analyze the capacity on capturing sensitivity to the dimensional collapse, we design five desirable constraints for ideal uniformity metrics, based on which we find that the proposed uniformity metric satisfies all constraints while the existing one does not. Synthetic experiments also demonstrate that the proposed uniformity metric is capable to distinguish different dimensional collapse degrees while the existing one in (Wang & Isola, 2020) is insensitive. Finally, we impose the proposed uniformity metric as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance.

1. INTRODUCTION

Self-supervised representation learning has become increasingly popular in machine learning community (Chen et al., 2020; He et al., 2020; Caron et al., 2020; Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021) , and achieved impressive results in various tasks such as object detection, segmentation, and text classification (Xie et al., 2021; Wang et al., 2021b; Yang et al., 2021; Zhao et al., 2021; Wang et al., 2021a; Gunel et al., 2021) . Aiming to learn representations that are invariant under different augmentations, a common practice of self-supervised learning is to maximize the similarity of representations obtained from different augmented versions of a sample by using a Siamese network (Bromley et al., 1994; Hadsell et al., 2006) . However, a common issue with this approach is the existence of trivial constant solutions that all representations collapse to a constant point (Chen & He, 2021) , as visualized in Fig. 1 , known as the collapse problem (Jing et al., 2022) .

Constant Collapse

Dimensional Collapse Many efforts have been made to prevent the vanilla Siamese network from the collapse problem. The wellknown solutions can be summarized into three types: contrastive learning (Chen et al., 2020; He et al., 2020; Caron et al., 2020) , asymmetric model architecture (Grill et al., 2020; Chen & He, 2021) , and redundancy reduction (Zbontar et al., 2021; Zhang et al., 2022b) . While these solutions could avoid the complete constant collapse, they might still suffer from a dimensional collapse (Hua et al., 2021) in which representations occupy a lower-dimensional subspace instead of the entire available embedding space (Jing et al., 2022) , as depicted in the Fig. 1 . Therefore, to show the effectiveness of the aforementioned approaches, we need a quantifiable metric to measure the collapse degree of learned representations. To gain a quantifiable analysis of collapse degree, recent works (Arora et al., 2019; Wang & Isola, 2020) propose to divide the loss function into alignment and uniformity terms. For instance, recent objective functions such as InfoNCE (van den Oord et al., 2018) and cross-correlation employed in Barlow Twins (Zbontar et al., 2021) could be divided into two terms. These uniformity terms could explain the degree of collapse to some extent, since they measure the variability of learned representations (Zbontar et al., 2021) . However, the calculation of these uniformity terms relies on the choice of anchor-positive pair, making them hard to be used as general metrics. Wang et al (Wang & Isola, 2020) further propose a formal definition of uniformity metric via Radial Basis Function (RBF) kernel (Cohn & Kumar, 2007) . Despite its usefulness (Gao et al., 2021; Zhou et al., 2022) , we theoretically and empirically demonstrate that the metric is insensitive to the dimensional collapse. In this paper, we focus on designing a new uniformity metric that could capture salient sensitivity to the dimensional collapse. Towards this end, we firstly introduce an interesting finding that representation that obeys zero-mean isotropic Gaussian distribution is with the ideal uniformity. Based on this finding, we use the Wasserstein distance between the distribution of learned representations and the ideal distribution as the metric of uniformity. By checking on five well-designed desirable properties (called 'desiderata') of uniformity, we theoretically demonstrate the proposed uniformity metric satisfies all desiderata while the existing one (Wang & Isola, 2020) does not. Synthetic experiments also demonstrate the proposed uniformity metric is capable to quantitatively distinguish various dimensional collapse degrees while the existing one is insensitive. Lastly, we apply our proposed uniformity metric in the practical scenarios, namely, imposing it as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance. The contributions of this work are summarized as: (i) We theoretically and empirically demonstrate the existing uniformity metric (Wang & Isola, 2020) is insensitive to the dimensional collapse, and we propose a new uniformity metric that could capture salient sensitivity to the dimensional collapse; (ii) By designing five desirable properties, we open a new perspective to rethink the ideal uniformity metrics; (iii) Our proposed uniformity metric can be applied as an auxiliary loss term in various self-supervised methods, which consistently improves the performance in downstream tasks.

2. BACKGROUND

2.1 SELF-SUPERVISED REPRESENTATION LEARNING Self-supervised representation learning aims to learn representations that are invariant to a series of different augmentations. Towards this end, a common practice is to maximize the similarity of representations obtained from different augmented versions of a sample. Specially, given a set of data samples {x 1 , x 2 , ..., x n }, a symmetric network architecture, also called Siamese network (Hadsell et al., 2006) , takes as input two randomly augmented views x a i and x b i from a input sample x i . Then the two views are processed by an encoder network f consisting of a backbone (e.g., ResNet (He et al., 2016) ) and a projection MLP head (Chen et al., 2020) , denoted as g. To enforce invariance to representations of two views z a i ≜ g(f (x a i )) and z b i ≜ g(f (x b i )) , a natural solution is to maximize the cosine similarity between representations of two views, and Mean Square Error (MSE) is a widely used loss function to align their l 2 -normalized representations on the surface of the unit hypersphere: L θ align = ∥ z a i ∥z a i ∥ - z b i ∥z b i ∥ ∥ 2 2 = 2 -2 • ⟨z a i , z b i ⟩ ∥z a i ∥ • ∥z b i ∥ (1) However, a common issue with this approach easily learns an undesired trivial solution that all representations collapse to a constant, as depicted in Fig. 1 .

2.2. EXISTING SOLUTIONS TO CONSTANT COLLAPSE

To prevent the Siamese network from the constant collapse, existing well-known solutions can be summarized into three types: contrastive learning, asymmetric model architecture, and redundancy reduction. More details will be explained in this section. Contrastive Learning Contrastive learning is one effective way to avoid constant collapse, and the core idea is to repulse negative pairs while attracting positive pairs. SimCLR (Chen et al., 2020) is one of the most representative works, which first proposes an in-batch negative trick that employs samples in a batch as negative samples. However, its effectiveness heavily relies on the large batch size. To overcome the limitation, MoCo (He et al., 2020) proposes a memory bank to save more representations as negative samples. Besides instance-wise contrastive learning approaches, some recent works also propose clustering-based contrastive learning by bringing together a clustering objective with contrastive learning (Li et al., 2021; Caron et al., 2020) . Asymmetric Model Architecture Asymmetric model architecture is another approach to prevent constant collapse, the core idea is to break the symmetry of the Siamese network. A possible explanation is that asymmetric architecture could encourage encoding more information (Grill et al., 2020) . To keep asymmetry, BYOL (Grill et al., 2020) proposes to use an extra predictor in one branch of the Siamese network, and use momentum update and stop-gradient operator in the other branch. An interesting work DINO (Caron et al., 2021) applies this asymmetry in two encoders, and distills knowledge from the momentum encoder to another branch (Hinton et al., 2015) . Chen et al. propose SimSiam (Chen & He, 2021) by removing the momentum update from BYOL, its success demonstrates the momentum update is not the key to preventing collapse. Mirror-SimSiam (Zhang et al., 2022a) further swap the stop-gradient operator to the other branch, its failure refutes the claim in SimSiam (Chen & He, 2021) , that the stop-gradient operator is the key component to preventing the model from collapse.

Redundancy Reduction

The principle for redundancy reduction to prevent constant collapse is to maximize the information content of the representations. The core is to achieve decorrelation by making the matrix based on representations as close to the identity matrix as possible. Barlow Twins (Zbontar et al., 2021) tries to achieve this end on the cross-correlation matrix, while VI-CReg (Bardes et al., 2022) chooses on the covariance matrix. Instead of applying regularization to the matrix, W-MSE (Ermolov et al., 2021) employs a direct way to make the covariance matrix equal to the identity matrix via feature-wise whitening. Zero-CL (Zhang et al., 2022b) further proposes the hybrid of instance-wise and feature-wise whitening to achieve this end.

2.3. COLLAPSE ANALYSIS

While aforementioned solutions could effectively prevent model from constant collapse, they might still suffer from the dimensional collapse in which representations occupy a lower-dimensional subspace instead of the entire available embedding space, as depicted in the Fig. 1 . The evidence of dimensional collapse was identified in contrastive learning by singular value spectrum of representations (Jing et al., 2022) . However, the singular value spectrum is in the form of pictures, making it hard to conduct statistical comparisons among various approaches in terms of collapse analysis. To gain a quantifiable analysis of collapse degree, Wang et al. propose a formal definition of uniformity metric in (Wang & Isola, 2020) , via Radial Basis Function (RBF) kernel (Cohn & Kumar, 2007) . More specially, given a set of representation vectors {z 1 , z 2 , ..., z n } (z i ∈ R m ), the uniformity metric is defined as follows: L U ≜ log 1 n(n -1)/2 n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 , t > 0, ( ) Where t is a fixed parameter (generally t = 2). Despite its usefulness, we theoretically and empirically demonstrate this metric is insensitive to the dimensional collapse in Sec. 4.2 and Sec. 4.3.

3. A NEW UNIFORMITY METRIC

In this section, we focus on designing a new uniformity metric that could capture salient sensitivity to the dimensional collapse. In Sec. 3.1, we introduce an interesting finding that the maximum uniformity could be achieved when learned representations obey zero-mean isotropic Gaussian distribution. To enforce the uniqueness of the ideal distribution, we adopt its l 2 -normalized form. Interestingly, we theoretically and empirically demonstrate it is an approximately Gaussian distribution in Sec. 3.2. Based on this principle, we propose to use Wasserstein distance between the distribution of learned representations and its ideal distribution as an uniformity metric in Sec. 3.3.

3.1. ZERO-MEAN ISOTROPIC GAUSSIAN DISTRIBUTION, THE MAXIMUM UNIFORMITY

As shown in Theorem 1, we provide a theorem that states maximum uniformity could be achieved if learned representations obey zero-mean isotropic Gaussian distribution (Z ∼ N (0, σ 2 I)). Theorem 1. Let a random variable Z ∼ N (0, σ 2 I m ) (Z ∈ R m ), its l 2 -normalized form Y = Z/∥Z∥ 2 uniformly distribute on the surface of a unit hypersphere S m-1 . See App. A for the proof. However, obeying zero-mean isotropic Gaussian distribution is a sufficient but not necessary for an ideal uniformity of its l 2 -normalized form. For example, as stated in Corollary 1, a mixture of two independent random variables following zero-mean isotropic Gaussian distribution also achieves the ideal uniformity. Corollary 1. For a random variable Z 1 , Z 2 ∈ R m that both follow Gaussian distributions. Namely, Z 1 ∼ N (0, σ 2 1 I m ), and Z 2 ∼ N (0, σ 2 2 I m ). Let Z be a mixture distribution 1 derived from Z 1 and Z 2 with any binary selection probabilities. Its l 2 -normalized form Y = Z/∥Z∥ 2 also uniformly distribute on the surface of the unit hypersphere S m-1 . The distribution to achieve ideal uniformity (i.e., the mixture of various zero-mean isotropic Gaussian distributions) is not unique due to the mixture form discussed in Corollary 1; in a sense one could define different mixtures of zero-mean isotropic Gaussian distributions, each of which might have different norms encapsulated in σ 1 , • • • , σ k . Therefore, we turn to investigate the l2-normalized form of these zero-mean isotropic Gaussian distributions 2 , see Sec. 3.2. 3.2 ON THE l 2 -NORMALIZED GAUSSIAN DISTRIBUTION This section will discuss the characteristics regarding a l 2 -normalized distribution of a Gaussian distribution mixture, which is found to be close to a Gaussian distribution N (0, 1 m I m ), from both a theoretical aspect (in Sec. 3.2.1) and an empirical aspect (in Sec. 3.2.2).

3.2.1. THEORETICAL CONNECTION BETWEEN Y AND THE GAUSSIAN DISTRIBUTION

For simplicity, we denote the l2-normalized form of zero-mean isotropic Gaussian distributions as Y, Y = Z/∥Z∥ 2 . Note that Y obeys uniform distribution on the surface of the unit hypersphere Y ∼ U (S m-1 ). Interestingly, we found Y approximates a Gaussian distribution N (0, 1 m I m ) when m is large enough. Particularly, each dimension of Y, denoted as Y i , degrades to a Gaussian distribution N (0, 1 m ) in terms of the Kullback-Leibler divergence when m is large enough, see Theorem 2. Theorem 2. For a random variable Y i in the i-th dimension of Y = Z/∥Z∥ 2 , where Z ∼ N (0, σ 2 I m ) (Z ∈ R m ), then the Kullback-Leibler divergence between Y i and the variable Ŷi ∼ N (0, 1 m ) converges to zero as m → ∞ as follows. lim m→∞ D KL ( Ŷi , Y i ) = 0 We firstly seek the probability density function (pdf) of Y i as shown in App. C. Since the probability density functions of both distributions are known, we could derive the Kullback-Leibler divergence between them. One trick is to expand a logarithm term using Taylor expansion. Finally, we obtain that the divergence has a limit of zero when m approaches infinity (Theorem 2 proved). See App. D for the detailed proof. 1 A mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables. This could be implemented by first sampling a random variable based on a given probability distribution w.r.t a ratio of each random variable, and then sampling a value based on the selected random variable. 2 Most recent self-supervised representation learning approaches learn representations with a l2 norm constraint (Zbontar et al., 2021; Wang & Isola, 2020; Chen & He, 2021; Grill et al., 2020; Chen et al., 2020) , restricting the output representations to the surface of unit hypersphere, i.e., the l2-normalized representation (Y def = Z/∥Z∥2) should be on the surface of the unit hypersphere S m-1 . This suggests that directions of learned representation vectors (instead of the absolute amplitude of elements in the vectors) matter when capturing the semantic information of instances. Therefore, we make an assumption that l 2 -normalized zero-mean isotropic Gaussian distribution (denoted Y) follows, or at least is close to, an approximated Gaussian distribution N (0, 1 m I m ), even m is moderately large. Note that N (0, 1 m I m ) enjoys the merits of uniqueness (a proper distribution to design uniformity metric), and might be used as an approximated distribution for Y. 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (a) m = 2 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (b) m = 4 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (c) m = 8 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (d) m = 16 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (e) m = 32 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (f) m = 64 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (g) m = 128 9DULDEOH %LQQLQJ'HQVLW\ Yi Yi (h) m = 256

3.2.2. EMPIRICAL CONNECTION BETWEEN Y AND THE GAUSSIAN DISTRIBUTION

The above theoretical conclusion states that the distribution of Y is infinitely close to a Gaussian distribution when is m infinitely large; while, in practice, we have to adopt finitely large m due to the memory limit. Here we empirically check the closeness between Y and a Gaussian distribution when using a manageable size of dimension m in practice. Without losing any generality, we analyze an arbitrary dimension of Y. The distribution of the i-th dimension of Y, as denoted as a random variable Y i , is visualized in Fig. 2 by binning 200,000 sampled data points (called 'samples' later) into 51 groups. Fig. 2 shows the distribution difference between Y i and Ŷi when selecting m from a manageable internal [2, 4, 8, 16, 32, 64, 128, 256] . Note that the difference becomes negligible when m is moderately large (e.g.. m > 32) . To quantitatively measure the closeness, Fig. 3 shows the change of the distance (e.g., Wasserstein distance as defined in App. G) between Y i and Ŷi with respect to increasing m. One could observe that the distance is converged to zero with the large m. This also empirically evidences the conclusion in Theorem 2. More details see App. H.

3.3. A NEW METRIC FOR UNIFORMITY

In this section, we propose to use the distance between the distribution of learned representations and its ideal Gaussian distribution N (0, 1 m I m ) as a uniformity metric. Specially, we collect a set of data vectors from learned representations, i.e., {z 1 , z 2 , ..., z n }, and adopt l 2 -normalized vectors to calculate the mean and covariance matrix as follows: µ = 1 n n i=1 z i /∥z i ∥, Σ = 1 n n i=1 (z i /∥z i ∥ -µ) T (z i /∥z i ∥ -µ), Where µ ∈ R m , Σ ∈ R m×m , and m is the dimension size of vectors. To facilitate the calculation of distribution distance, we apply a Gaussian hypothesis to learned representations N (µ, Σ). Based on this assumption, we employ Wasserstein distancefoot_0 , a well-known distribution distance, to calculate the distance between two distributions, which takes the minimum cost of turning one pile into the other when viewing each distribution as a unit amount of earth/soil, see the definition in App. G. Theorem 3. Wasserstein Distance (Olkin & Pukelsheim (1982) ) Suppose two random variables Z 1 ∼ N (µ 1 , Σ 1 ) and Z 2 ∼ N (µ 2 , Σ 2 ) obey multivariate normal distributions, then l 2 -Wasserstein distance between Z 1 and Z 2 is: W 2 (Z 1 , Z 2 ) = ∥µ 1 -µ 2 ∥ 2 2 + T r(Σ 1 + Σ 2 -2(Σ 1/2 2 Σ 1 Σ 1/2 2 ) 1/2 ), Despite its complexity, Wasserstein distance over Gaussian distributions is easy to calculate as illustrated in Theorem 3. We instantiate Equation 4 with the distribution of learned representations and ideal distribution. Then, an uniformity metric via Wasserstein distance can be formulated as: W 2 ≜ ∥µ∥ 2 2 + 1 + T r(Σ) - 2 √ m T r(Σ 1/2 ), The smaller W 2 , indicates the larger uniformity of representations. Besides its usefulness in collapse analysis, our proposed uniformity metric can be also used as an additional loss for various existing self-supervised methods since it is differentiable during the backward pass. One difference is that the mean and covariance matrix in Equation 3 is calculated by batch data during the training phase.

4. ON UNIFORMITY METRICS

In this section, we first introduce the desirable properties (called 'Desiderata') of any well-defined uniformity metric in Sec. 4.1. Sec.4.2 and Sec. 4.3 compare the proposed uniformity metric -W 2 with existing uniformity metric -L U theoretically and empirically respectively.

4.1. DESIDERATA OF UNIFORMITY

A uniformity metric is a function to map a set of learned representations (typically dense vectors) to a uniformity indicator (typically a real number). U : {R m } n → R, D ∈ {R m } n is a set of learned vectors (D = {z 1 , z 2 , ..., z n }), each vector is the feature representation of a instance, z i ∈ R m . In this section, we formally define five desiderata (i.e., desirable properties) for any uniformity metrics. Intuitively, uniformity is invariant to the permutation of instances, as it cannot affect the distribution. Property 1. Instance Permutation Constraint (IPC) U(π(D)) = U(D), π is an instance permutation operator that changes the order of representations. The uniformity should be invariant when all representations are re-scaled, since modern machine learning tends to use directions of learned representation vectors to capture the semantic information of instances. For example, most recent self-supervised representation learning approaches learn representations with a l 2 norm constraint (Zbontar et al., 2021; Wang & Isola, 2020; Grill et al., 2020; Chen et al., 2020) , restricting the output representations to the surface of unit hypersphere, i.e., D s = {s 1 , s 2 , ..., s n }, and s i = z i /∥z i ∥ 2 is on the surface of the unit hypersphere S m-1 . Property 2. Instance Scaling Constraint (ISC) U({λ 1 z 1 , λ 2 z 2 , ..., λ n z n }) = U(D), ∀λ i ∈ R + , ( ) The uniformity is invariant when instances are cloned, since the cloning operator does not change the original distribution density. Property 3. Instance Cloning Constraint (ICC) U(D ∪ D) = U(D), ∪ is the union of two sets that can achieve instance cloning, D ∪ D = {z 1 , • • • , z n , z 1 , • • • , z n }. The uniformity decreases when cloning features for each instance, since the feature-level clone will bring some redundancy, leading to dimensional collapse (Zbontar et al., 2021; Bardes et al., 2022) . Property 4. Feature Cloning Constraint (FCC) U(D ⊕ D) ≤ U(D), ⊕ is an feature-level concatenation operator that can achieve feature cloning as D⊕D = {z 1 ⊕z 1 , z 2 ⊕ z 2 , ..., z n ⊕ z n }, and where z i ⊕ z i = [z i1 , • • • , z im , z i1 , • • • , z im ] T ∈ R 2m . U(D ⊕ D) = U(D) if and only if z 1 = z 2 = ... = z n = 0 m . The uniformity decreases when adding constant features for each instance, since it introduces uninformative features and results in some collapsed dimensions. Property 5. Feature Baby Constraint (FBC) U(D ⊕ 0 k ) ≤ U(D), k ∈ N + , D ⊕ 0 k = {z 1 ⊕ 0 k , z 2 ⊕ 0 k , ..., z n ⊕ 0 k }, z i ⊕ 0 k = [z i1 , z i2 , ..., z im , 0, 0, ..., 0] T ∈ R m+k . U(D ⊕ 0 k ) ≤ U(D) if and only if z 1 = z 2 = ... = z n = 0 m . Note that these five properties are necessary but not sufficient for a well-designed uniformity metric. That is, a well-designed uniformity metric should satisfy these properties; while only satisfying these properties does not sufficiently lead to an ideal uniformity metric.

4.2. EXAMINING DESIDERATA OF UNIFORMITY

We employ the desiderata in Sec. 4.1 as criterion to conduct theoretical analysis for two metrics -L U in Equation 2 and -W 2 in Equation 5. The conclusion is stated in the Claim 1 and Claim 2. Claim 1. Our proposed metric (i.e., -W 2 ) satisfies all properties including Property 1, 2, 3, 4, and 5. See App. E.1 for the detailed proof. Claim 2. the baseline metric (i.e., -L U ) satisfies Property 1 and 2; but it violates Property 3, 4, and 5. See App. E.2 for the detailed proof. In terms of Property IPC and Property ISC, we can directly use their definition to demonstrate both two metrics satisfy the two properties. To further check whether two metric could satisfy other three properties, see App. E for the detailed proof. Particularly, the proposed metric -W 2 satisfies FBC Property while the baseline metric -L U does not. This opens a new angle to explain the advantage of our proposed metric -W 2 from the dimensional collapse perspective. Specially, the larger k would bring the more serious dimensional collapse for D⊕0 k than D. However, -L U fails to identify the more serious dimensional collapse due to -L U (D ⊕ 0 k ) = -L U (D). On the contrary, our proposed metric is sensitive to the dimensional collapse as -W 2 (D ⊕ 0 k ) < -W 2 (D). Correlation between L U and W 2 . We employ synthetic experiments to study uniformity metrics. In detail, we manually sample 50000 data vectors from different distributions, such as standard Gaussian distribution N (0, I), uniform Distribution U (0, 1), the mixture of Gaussian, etc.

4.3. EMPIRICAL ANALYSIS

Based on these data vectors, we estimate the uniformity of different distributions by two metrics. As shown in Fig. 4 , standard Gaussian distribution achieves the minimum values by both W 2 and L U , which indicates that standard Gaussian distribution could achieve larger uniformity than other distributions. This empirical result is consistent with Theorem 1 that standard Gaussian distribution achieves the maximum uniformity. On the Dimensional Collapse. To synthesize data with various specified degrees of dimensional collapse, we concatenate the zero vectors (i.e., they are full dimensional collapse) with sampled data vectors from the standard Gaussian distribution (i.e., ideal uniformity without collapse). The percentage of zero-value dimensions of the concatenated vectors is η while that of non-zero vectors is 1 -η. As shown in Fig. 5 Interestingly, as visualized in Fig. 6 , L U becomes indistinguishable with different degrees of dimension collapse (η = 25%, 50%, and75%) when the dimension m becomes large (e.g., m ≥ 2 8 ). On the contrary, our proposed W 2 is constant to the dimension number under a specific degree of dimension collapse; W 2 only depends on the degree of dimension collapse and is independent of the dimension number. In summary, our proposed metric W 2 is a more reasonable metric to measure the uniformity than the existing one L U from an empirical perspective.

5. EXPERIMENTS

In this section, we impose the proposed uniformity metric as an auxiliary loss term for various existing self-supervised methods, and conduct experiments on CIFAR-10 and CIFAR-100 datasets to demonstrate its effectiveness. Codes implemented in Pytorch will be released.

Models

We conduct experiments on a series of self-supervised representation learning models: (i) AlignUniform (Wang & Isola, 2020) , whose loss objective consists of an alignment objective and a uniform objective. (ii) three contrastive methods, i.e., SimCLR (Chen et al., 2020) , MoCo (He et al., 2020) , and NNCLR (Dwibedi et al., 2021) . (iii) two asymmetric models, i.e., BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2021) . (iv) two methods via redundancy reduction, i.e., BarlowTwins (Zbontar et al., 2021) and Zero-CL (Zhang et al., 2022b) . To study the behavior of proposed Wasserstein distance in the self-supervised representation learning, we impose it as an auxiliary loss term to the following models: MoCo v2, BYOL, BarlowTwins, and Zero-CL. To facilitate better use of Wasserstein distance, we also design a linear decay for weighting Wasserstein distance during the training phase, i.e., α t = α max -t * (α max -α min )/T , where t, T , α max , α min , α t are current epoch, maximum epochs, maximum weight, minimum weight, and current weight, respectively. More detailed experiments setting see in App. J. Metrics We evaluate the above methods from two perspectives: one is linear evaluation accuracy measured by Top-1 accuracy (Acc@1) and Top-5 accuracy (Acc@5); another is representation capacity. According to (Arora et al., 2019; Wang & Isola, 2020) , alignment and uniformity are the two most important properties to evaluate self-supervised representation learning. We use two metrics L U and W 2 to measure the uniformity, and a metric A to measure the alignment between the positive pairs (Wang & Isola, 2020) . More details about the alignment metric see in App. K. Main Results As shown in Tab. 1 We could observe that by imposing W 2 as an additional loss it consistently improves the performance than that without the loss. Interestingly, although it slightly harms alignment, it usually results in improvement in uniformity and finally leads to better accuracy. This demonstrates the effectiveness of W 2 as a uniformity metric. Note imposing an additional loss during training does not affect the training or inference efficiency; therefore, adding W 2 as loss is beneficial without any tangible costs.

Convergence Analysis

We test the Top-1 accuracy of these models on CIFAR-10 and CIFAR-100 via linear evaluation protocol (as described in App. J) when training them in different epochs. As shown in Fig. 9 in App. L. By imposing W 2 as an additional loss for these models, it converges faster than the raw models, especially for MoCo v2 and BYOL with serious collapse problem. Our experiments show that imposing the proposed uniformity metric as an auxiliary penalty loss could largely improve uniformity but damage alignment, see more representation analysis in App. M.

Dimensional Collapse Analysis

To gain a better understanding of how the additional loss W 2 benefits the alleviation of the dimensional collapse, we visualize singular value spectrum of the representations (Jing et al., 2022) . As shown in Fig. 7 , the spectrum contains the singular values of the covariance matrix of representations from CIFAR-100 dataset in sorted order and logarithmic scale. Most singular values collapse to zero in BYOL and MoCo v2 models (exclude BarlowTwins), indicating a large number of collapsed dimensions occur in both models. By imposing W 2 as an additional loss for these two models, the number of collapsed dimensions almost decrease to zero, indicating W 2 can effectively address the issue of dimensional collapse. 'LPHQVLRQV /RJRI6LQJXODU9DOXHV 0R&RY 0R&RY (a) MoCo v2 'LPHQVLRQV /RJRI6LQJXODU9DOXHV %<2/ %<2/ (b) BYOL 'LPHQVLRQV /RJRI6LQJXODU9DOXHV %DUORZ7ZLQV %DUORZ7ZLQV (c) BarlowTwins Figure 7 : Dimensional collapse analysis on CIFAR-100 dataset.

6. CONCLUSION

In this paper, we theoretically and empirically demonstrate that the existing uniformity metric is insensitive to the dimensional collapse, and focus on designing a new uniformity metric that could capture salient sensitivity to the dimensional collapse. To this end, we propose to use the Wasserstein distance between the distribution of learned representations and the ideal distribution as the metric of uniformity. Furthermore, we formulate five desirable constraints (desiderata) for ideal uniformity metrics, based on which we find that the proposed uniformity metric satisfies all desiderata while the existing one does not. Moreover, we conduct synthetic experiments to further demonstrate that the proposed uniformity metric is capable to deal with the dimensional collapse while the existing one is insensitive. Finally, we apply our proposed metric in the practical scenarios, and impose the proposed uniformity metric as an auxiliary loss term for various existing self-supervised methods, which consistently improves the downstream performance. One limitation of our work is that five desirable constraints (desiderata) are not sufficient for ideal uniformity metrics. In future work, we would make further efforts to seek more reasonable properties for uniformity metrics.

A PROOF OF THE THEOREM 1

Proof. According to the property of Gaussian distribution, the distribution of the variable Z ∼ N (0, σ 2 I m ) is invariant to arbitrary orthogonal transformation U: Ẑ = UZ ∼ N (U0, σ 2 UI m U T ) ∼ N (0, σ 2 I m ) (U0 = 0, UI m U T = UU T = I m ) Therefore Ẑ is identically distributed with the random variable Z. We denote identically distributed operation as Ẑ id ↔ Z. For the l 2 -normalized variables: Y = Z/∥Z∥ 2 , Ŷ = Ẑ/∥ Ẑ∥ 2 , Ŷ id ↔ Y. Since ∥UZ∥ 2 = (UZ) T (UZ) = √ Z T U T UZ = √ Z T Z = ∥Z∥ 2 , Ŷ = UZ ∥UZ∥ 2 = UZ ∥Z∥ 2 = UY, Therefore, Y is an identically distributed operation as UY, i.e., Y id ↔ UY after an arbitrary orthogonal transformation. To conclude that the random variable Y uniformly distributes on the surface of the unit hypersphere S m-1 = {y ∈ R m : ∥y∥ 2 = 1}, here we use the proof by contradiction. Let us assume the opposite of the above conclusion: Y does not uniformly distribute on the surface of the unit hypersphere S m-1 . In other words, the density of each specified-sized area in Y is not identical for the unit hypersphere S m-1 . The random variable Y has a continuous density ρ. Suppose that for r 1 , r 2 ∈ S m-1 , r 1 ̸ = r 2 and ρ(r 1 ) > ρ(r 2 ), there exists a radius ϵ for any l 2 -norm (also holds for other norms) such that on D 1 = {r ∈ S m-1 : ∥r -r 1 ∥ 2 < ϵ}, D 2 = {r ∈ S m-1 : ∥r -r 2 ∥ 2 < ϵ}, we still have ∀r ∈ D 1 , ∀s ∈ D 2 , ρ(r) > ρ(s). Therefore, P (D 1 ) > P (D 2 ). Since Y id ↔ UY, D 2 can be obtained from D 1 by a orthogonal transformationfoot_1 , which implies that P (D 1 ) = P (D 2 ). Contradiction! Hence ρ(r 1 ) = ρ(r 2 ) for ∀r 1 , r 2 ∈ S m-1 and r 1 ̸ = r 2 . Therefore, Y = Z/∥Z∥ 2 uniformly distributes on the hypersphere S m-1 .

B MEAN AND COVARIANCE MATRIX OF Y

Theorem 4. For a random variable Z ∼ N (0, σ 2 I m ), and Z ∈ R m , for the l 2 -normalized form Y = Z/∥Z∥ 2 , its mean and covariance matrix can be formulated as follows. µ = 0, Σ = 1 m I m , Proof. Z = [z 1 , z 2 , • • • , z m ] ∼ N (0, σ 2 I m ) , and its probability density function (pdf) can be written as: f Z (z) = 1 (2π) m/2 |σ 2 I m | 1/2 exp{- 1 2 z T (σ 2 I m ) -1 z} = 1 (2πσ 2 ) m/2 exp{- 1 2 m i=1 z 2 i /σ 2 }, We denote Y = [y 1 , y 2 , • • • , y m ]. Then the mean of i-th variable y i can be written as below: E[y i ] = z1 z2 ••• zm z i m i z 2 i 1 (2πσ 2 ) m/2 exp{- 1 2 m i=1 z 2 i /σ 2 }dz 1 dz 2 • • • dz m , As zi √ m i z 2 i is an odd function, E[y i ] = 0, and we further conclude µ = E[Y] = 0. We also derive the covariance matrix of Y according to its definition as below: Σ = E[(Y -E[Y])(Y -E[Y]) T ] =    E[y 2 1 ] E[y 1 y 2 ] • • • E[y 1 y m ] E[y 2 y 1 ] E[y 2 2 ] • • • E[y 2 y m ] • • • • • • • • • • • • E[y m y 1 ] E[y m y 2 ] • • • E[y 2 m ]    Then E[y i y j ] (∀i ̸ = j) can be formulated as follows: E[y i y j ] = z1 z2 ••• zm z i m i z 2 i z j m i z 2 i 1 (2πσ 2 ) m/2 exp{- 1 2 m i=1 z 2 i /σ 2 }dz 1 dz 2 • • • dz m , As zi √ m i z 2 i zj √ m i z 2 i is an odd function, E[y i y j ] = 0 (∀i ̸ = j). In terms of diagonal elements in Σ, we employ the symmetry to conclude E[y 2 1 ] = E[y 2 2 ] = • • • = E[y 2 m ]. Based on this principle, we conclude E[y 2 i ] = 1 m via below equations: E[ m i y 2 i ] = mE[y 2 i ], E[ m i y 2 i ] = E[ m i z 2 i m i z 2 i ] = 1, Therefore, Σ = 1 m I m . C PROBABILITY DENSITY FUNCTION OF Y i Theorem 5. For a random variable Z ∼ N (0, σ 2 I m ), and Z ∈ R m , for the l 2 -normalized form Y = Z/∥Z∥ 2 , the probability density function (pdf) of a variable Y i in the arbitrary dimension is: f Yi (y i ) = Γ(m/2) √ πΓ((m -1)/2) (1 -y 2 i ) (m-3)/2 Proof. Z = [Z 1 , Z 2 , • • • , Z m ] ∼ N (0, σ 2 I m ), then Z i ∼ N (0, σ 2 ), ∀i ∈ [1, m]. We denote the variable U = Z i /σ ∼ N (0, 1), V = m j̸ =i (Z j /σ) 2 ∼ X 2 (m -1) , then U and V are independent with each other. For the variable T = U √ V /(m-1) , it obeys the Student's t-distribution with m -1 degrees of freedom, and its probability density function (pdf) is: f T (t) = Γ(m/2) (m -1)πΓ((m -1)/2) (1 + t 2 m -1 ) -m/2 For the variable Y i = Zi √ m i=1 Z 2 i = Zi √ Z 2 i + m j̸ =i Z 2 j = Zi/σ √ (Zi/σ) 2 + m j̸ =i (Zj /σ) 2 = U √ U 2 +V , then T = U √ V /(m-1) = √ m-1Yi √ 1-Y 2 i and Y i = T √ T 2 +m-1 , the relation between the cumulative distribution function (cdf) of T and that of Y i can be formulated as follows: F Yi (y i ) = P ({Y i ≤ y i }) = P ({Y i ≤ y i }) y i ≤ 0 P ({Y i ≤ 0}) + P ({0 < Y i ≤ y i }) y i > 0 = P ({ T √ T 2 +m-1 ≤ y i }) y i ≤ 0 P ({ T √ T 2 +m-1 ≤ 0}) + P ({0 < T √ T 2 +m-1 ≤ y i }) y i > 0 = P ({ T 2 T 2 +m-1 ≥ y 2 i , T ≤ 0}) y i ≤ 0 P ({T ≤ 0} + P ({ T 2 T 2 +m-1 ≤ y 2 i , T > 0}) y i > 0 =    P ({T ≤ √ m-1yi √ 1-y 2 i }) y i ≤ 0 P ({T ≤ 0} + P ({0 < T ≤ √ m-1yi √ 1-y 2 i }) y i > 0 = P ({T ≤ √ m -1y i 1 -y 2 i }) = F T ( √ m -1y i 1 -y 2 i ) Therefore, the pdf of Y i can be derived as follows: f Yi (y i ) = d dy i F Yi (y i ) = d dy i F T ( √ m -1y i 1 -y 2 i ) = f T ( √ m -1y i 1 -y 2 i ) d dy i ( √ m -1y i 1 -y 2 i ) = [ Γ(m/2) (m -1)πΓ((m -1)/2) (1 -y 2 i ) m/2 ][ √ m -1(1 -y 2 i ) -3/2 ] = Γ(m/2) √ πΓ((m -1)/2) (1 -y 2 i ) (m-3)/2 D PROOF OF THE THEOREM 2 Proof. For the variable Ŷi ∼ N (0, 1 m ), its pdf and k-th order raw moment can be formulated as: fŶ i (y) = m 2π exp{- my 2 2 }, E[ Ŷ k i ] = k/2 j=1 (2j-1) m j k = 2j, j = 1, 2, 3... 0 k = 2j -1 According to the Theorem 5, the pdf of Y i is: f Yi (y i ) = Γ(m/2) √ πΓ((m -1)/2) (1 -y 2 i ) (m-3)/2 For 0 ≤ y 2 < 1, the Taylor expansion of log(1 -y 2 ) can be written as: log(1 -y 2 ) = - ∞ j=1 y 2j j Then the Kullback-Leibler divergence between Ŷi and Y i can be formulated as: D KL ( Ŷi , Y i ) = ∞ -∞ fŶ i (y)[log fŶ i (y) -log f Yi (y i )]dy = ∞ -∞ fŶ i (y)[log m 2π - my 2 2 -log Γ(m/2) √ πΓ((m -1)/2) - m -3 2 log(1 -y 2 )]dy = log m 2π -log Γ(m/2) √ πΓ((m -1)/2) + ∞ -∞ fŶ i (y)[- my 2 2 - m -3 2 log(1 -y 2 )] = log m 2 Γ((m -1)/2) Γ(m/2) + ∞ -∞ fŶ i (y)[- my 2 2 + m -3 2 ∞ j=1 y 2j j ] = log m 2 Γ((m -1)/2) Γ(m/2) - m 2 E( Ŷ 2 i ) + m -3 2 ∞ j=1 E( Ŷ 2j i )/j = log m 2 Γ((m -1)/2) Γ(m/2) - 1 2 + m -3 2 [ 1 m + 3 2m 2 + 5 * 3 3m 3 + o( 1 m 3 )] According to the Stirling formula, we have Γ(x + α) → Γ(x)x α as x → ∞, therefore: lim m→∞ log m 2 Γ((m -1)/2) Γ(m/2) = lim m→∞ log m 2 Γ((m -1)/2) Γ((m -1)/2)( m-1 2 ) 1/2 = lim m→∞ log m 2 2 m -1 = 0 Then the Kullback-Leibler divergence between Ŷi and Y i converges to zero as m → ∞ as follows: lim m→∞ D KL ( Ŷi , Y i ) = lim m→∞ log m 2 Γ((m -1)/2) Γ(m/2) - 1 2 + m -3 2 [ 1 m + 3 2m 2 + 5 * 3 3m 3 + o( 1 m 3 )] = 0 + lim m→∞ - 1 2 + m -3 2 [ 1 m + 3 2m 2 + 5 * 3 3m 3 + o( 1 m 3 )] = 0 E EXAMINING THE DESIDERATA FOR TWO UNIFORMITY METRICS E.1 PROOF FOR -W 2 ON DESIDERATA The first two properties (Property 1 and 2) could be easily proved using the definition. We here to examine the rest three properties one by one for the proposed uniformity metric -W 2 . Proof. Firstly, we prove that our proposed metric -W 2 could satisfy the Property 3. As D ∪ D = {z 1 , z 2 , ..., z n , z 1 , z 2 , ..., z n }, then its mean vector and covariance matrix can be formulated as follows: μ = 1 2n n i=1 2z i /∥z i ∥ = µ, Σ = 1 2n n i=1 2(z i /∥z i ∥ -μ) T (z i /∥z i ∥ -μ) = Σ, Then we have: W 2 (D ∪ D) ≜ ∥ μ∥ 2 2 + 1 + T r( Σ) - 2 √ m T r( Σ1/2 ) = W 2 (D). Therefore, -W 2 (D ∪ D) = -W 2 (D), indicating that our proposed metric -W 2 could satisfy the Property 3. Then, we prove that our proposed metric -W 2 could satisfy the Property 4. Given z i = [z i1 , z i2 , ..., z im ] T , and ẑi = z i ⊕ z i = [z i1 , z i2 , ..., z im , z i1 , z i2 , ..., z im ] T ∈ R 2m , for the set: D ⊕ D, its mean vector and covariance matrix can be formulated as follows: μ = µ/ √ 2 µ/ √ 2 , Σ = Σ/2 Σ/2 Σ/2 Σ/2 As Σ1/2 = Σ 1/2 /2 Σ 1/2 /2 Σ 1/2 /2 Σ 1/2 /2 , T r( Σ) = T r(Σ) and T r( Σ1/2 ) = T r(Σ 1/2 ), Then we have, W 2 (D ⊕ D) ≜ ∥ μ∥ 2 2 + 1 + T r( Σ) - 2 √ 2m T r( Σ1/2 ) = ∥µ∥ 2 2 + 1 + T r(Σ) - 2 √ 2m T r(Σ 1/2 ), > ∥µ∥ 2 2 + 1 + T r(Σ) - 2 √ m T r(Σ 1/2 ) = W 2 (D), Therefore, -W 2 (D ⊕ D) < -W 2 (D), indicating that our proposed metric -W 2 could satisfy the Property 4. Finally, we prove that our proposed metric -W 2 could satisfy the Property 5. Given z i = [z i1 , z i2 , ..., z im ] T , and ẑi = z i ⊕ 0 k = [z i1 , z i2 , ..., z im , 0, 0, ..., 0] T ∈ R m+k , for the set: D ⊕ 0 k , its mean vector and covariance matrix can be formulated as follows: μ = µ 0 k , Σ = Σ 0 m×k 0 k×m 0 k×k Therefore, T r( Σ) = T r(Σ), and T r( Σ1/2 ) = T r(Σ 1/2 ): W 2 (D ⊕ 0 k ) ≜ ∥ μ∥ 2 2 + 1 + T r( Σ) - 2 √ m + k T r( Σ1/2 ) = ∥µ∥ 2 2 + 1 + T r(Σ) - 2 √ m + k T r(Σ 1/2 ) > ∥µ∥ 2 2 + 1 + T r(Σ) - 2 √ m T r(Σ 1/2 ) = W 2 (D) Therefore, -W 2 (D ⊕ 0 k ) < -W 2 (D) , indicating that our proposed metric -W 2 could satisfy the Property 5. E.2 PROOF FOR -L U ON DESIDERATA The first two properties (Property 1 and 2) could be easily proved using the definition. We here to examine the rest three properties one by one for the existing uniformity metric -L U . Proof. Firstly, we prove that the baseline metric -L U cannot satisfy the Property 3. According to the definition of L U in Equation 2, we have: L U (D ∪ D) ≜ log 1 2n(2n -1)/2 (4 n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 + n i=1 e -t∥ z i ∥z i ∥ - z i ∥z i ∥ ∥ 2 2 ) = log 1 2n(2n -1)/2 (4 n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 + n), We set G = n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 , and then we have: G = n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 ≤ n i=2 i-1 j=1 e -t∥ z i ∥z i ∥ - z i ∥z i ∥ ∥ 2 2 = n(n -1)/2 G = n(n -1)/2 if and only if z 1 = z 2 = ... = z n . L U (D ∪ D) -L U (D) = log 4G + n 2n(2n -1)/2 -log G n(n -1)/2 = log (4G + n)n(n -1)/2 2nG(2n -1)/2 = log (4G + n)(n -1) 4nG -2G = log 4nG -4G + n 2 -n 4nG -2G ≥ log 1 = 0. L U (D ∪ D) = L U (D) if and only if G = n(n 1)/2, which requires z 1 = z 2 = ... = z n (an extreme case that all representations collapse to a constant point, as depicted in the Fig. 1 ). We exclude this extreme case for consideration in the paper, and we have -L U (D ∪ D) < -L U (D). Therefore, the baseline metric -L U cannot satisfy the Property 3. Then, we prove that the baseline metric -L U cannot satisfy the Property 4. Given z i = [z i1 , z i2 , ..., z im ] T , and z j = [z j1 , z j2 , ..., z jm ] T , and we set ẑi = z i ⊕ z i and ẑj = z j ⊕ z j , we have: L U (D ⊕ D) ≜ log 1 n(n -1)/2 n i=2 i-1 j=1 e -t∥ ẑi ∥ẑ i ∥ -ẑj ∥ẑ j ∥ ∥ 2 2 , As ẑi = [z i1 , z i2 , ..., z im , z i1 , z i2 , ..., z im ] T and ẑj = [z j1 , z j2 , ..., z jm , z j1 , z j2 , ..., z jm ] T , then ∥ẑ i ∥ = √ 2∥z i ∥, ∥ẑ j ∥ = √ 2∥z j ∥, and ⟨ẑ i , ẑj ⟩ = 2⟨z i , z j ⟩, we have: ∥ ẑi ∥ẑ i ∥ - ẑj ∥ẑ j ∥ ∥ 2 2 = 2 -2 ⟨ẑ i , ẑj ⟩ ∥ẑ i ∥∥ẑ j ∥ = 2 -2 2⟨z i , z j ⟩ √ 2∥z i ∥ √ 2∥z j ∥ = ∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 , Therefore, -L U (D ⊕ D) = -L U (D) , indicating that the baseline metric -L U cannot satisfy the Property 4. Finally, we prove that the baseline metric -L U cannot satisfy the Property 5. Given z i = [z i1 , z i2 , ..., z im ] T , and z j = [z j1 , z j2 , ..., z jm ] T , and we set ẑi = z i ⊕ 0 k and ẑj = z j ⊕ 0 k , we have: L U (D ⊕ 0 k ) ≜ log 1 n(n -1)/2 n i=2 i-1 j=1 e -t∥ ẑi ∥ẑ i ∥ -ẑj ∥ẑ j ∥ ∥ 2 2 , As ẑi = [z i1 , z i2 , ..., z im , 0, 0, ..., 0] T , and ẑj = [z j1 , z j2 , ..., z jm , 0, 0, ..., 0] T , then ∥ẑ i ∥ = ∥z i ∥, ∥ẑ j ∥ = ∥z j ∥, and ⟨ẑ i , ẑj ⟩ = ⟨z i , z j ⟩, therefore:  ∥ ẑi ∥ẑ i ∥ - ẑj ∥ẑ j ∥ ∥ 2 2 = 2 -2 ⟨ẑ i , ẑj ⟩ ∥ẑ i ∥∥ẑ j ∥ = 2 -2 ⟨z i , z j ⟩ ∥z i ∥∥z j ∥ = ∥ z i ∥z i ∥ - z j ∥z j ∥ ∥ 2 2 , Therefore, -L U (D ⊕ 0 k ) = -L U where Π(P r , P g ) denotes the set of all joint distributions γ(x, y) whose marginals are respectively P r and P g . Intuitively, when viewing each distribution as a unit amount of earth/soil, Wasserstein Distance or Earth-Mover Distance takes the minimum cost of transporting "mass" from x to y in order to transform the distribution P r into the distribution P g .

H DETAILS ON BINNING DENSITY

Details for 1D Visualization The density of Y i and Ŷi visualized in Fig. 2 is estimated by binning 200000 data samples into 51 groups. We observe that the density of Y i would be more overlapped with that of Ŷi . To further verify our observation, we instantiate P r and P g in Equation 12with the binning density of Y i and Ŷi , and employ W 1 (P r , P g ) as the distribution distance between Y i and Ŷi . We calculate W 1 (P r , P g ) ten times and average them as visualized in Fig. 3 .

Details for 2D Visualization

The joint density of (Y i , Y j ) and ( Ŷi , Ŷj ) (i ̸ = j), visualized in Fig. 8 is estimated by 2000000 data samples into 51 × 51 groups in two-axis (m = 32).

I OTHER DISTRIBUTION DISTANCES OVER GAUSSIAN DISTRIBUTION

In this section, besides Wasserstein distance over Gaussian distribution, as shown in Theorem 3, we also discuss using other distribution distances as uniformity metrics, and make comparisons with Wasserstein distance. As provided Kullback-Leibler Divergence and Bhattacharyya Distance over Gaussian distribution in Theorem 6 and in Theorem 7, both calculations require the covariance matrix is a full rank matrix, making them hard to be used to conduct dimensional collapse analysis. On the contrary, our proposed uniformity metric via Wasserstein distance is free from such requirement on the covariance matrix, making it easier to be widely used in practical scenarios. Theorem 6. Kullback-Leibler Divergence (Lindley & Kullback (1959) ) Suppose two random variables Z 1 ∼ N (µ 1 , Σ 1 ) and Z 2 ∼ N (µ 2 , Σ 2 ) obey multivariate normal distributions, then Kullback-Leibler divergence between Z1 and Z 2 is: D KL (Z 1 , Z 2 ) = 1 2 ((µ 1 -µ 2 ) T Σ -1 2 (µ 1 -µ 2 ) + T r(Σ -1 2 Σ 1 -I) + ln det Σ 2 det Σ 1 ), Theorem 7. Bhattacharyya Distance (Bhattacharyya (1943) ) Suppose two random variables Z 1 ∼ N (µ 1 , Σ 1 ) and Z 2 ∼ N (µ 2 , Σ 2 ) obey multivariate normal distributions, Σ = 1 2 (Σ 1 + Σ 2 ), then bhattacharyya distance between Z1 and Z 2 is: D B (Z 1 , Z 2 ) = 1 8 (µ 1 -µ 2 ) T Σ -1 (µ 1 -µ 2 ) + 1 2 ln det Σ √ det Σ 1 det Σ 2 ,

J EXPERIMENTS SETTING IN THE EXPERIMENTS

Setting To make a fair comparison, we conduct all experiments in Sec. 5 on a single 1080 GPU. Also, we adopt the same network architecture for all models, i.e., ResNet-18 (He et al., 2016) as the encoder, a three-layer MLP as the projector, and a three-layer MLP as the projector, respectively. Besides, We use LARS optimizer (You et al., 2017) with a base learning rate 0.2, along with a cosine decay learning rate schedule (Loshchilov & Hutter, 2017 ) for all models. We evaluate all models under a linear evaluation protocol. In specific, models are pre-trained for 500 epochs and evaluated by adding a linear classifier and training the classifier for 100 epochs while keeping the learned representations unchanged. We also deploy the same augmentation strategy for all models, which is the composition of a series of data augmentation operations, such as color distortion, rotation, and cutout. Following (da Costa et al., 2022) , we set temperature t = 0.2 for all contrastive methods. As for MoCo (He et al., 2020) and NNCLR (Dwibedi et al., 2021) that require an extra queue to save negative samples, we set the queue size to 2 12 . For the linear decay for weighting Wasserstein distance, detailed parameter settings are shown in Table 2 . 



We also discuss using other distribution distances as uniformity metrics, such as Kullback-Leibler Divergence and Bhattacharyya Distance over Gaussian distribution. See more details in App. I Let W be a orthogonal transformation such that W r1 = r2. D2 could be obtained by transforming every points from D1 using orthogonal transformation W , namely D2 = {W r : r ∈ D1},



Figure 1: The left figure presents constant collapse, and the right figure visualizes dimensional collapse.

Figure 2: The binning density between Yi and Ŷi over various dimensions. See 2d visualization in App. F

Figure 3: Distance between Yi and Ŷi

Figure 4: Uniformity analysis on distributions via two metrics.

Figure 5: Analysis on dimensional collapse degrees. W2 is more sensitive to collapse degrees than LU .

Figure 8: Visualization of two arbitrary dimensions for Y and Ŷ when m = 32. See the binning density in one-dimensional visualization over various dimensions in Fig. 2

Main comparison on CIFAR-10 and CIFAR-100 datasets. Proj. and Pred. are the hidden dimension in projector and predictor. ↑ and ↓ mean gains and losses, respectively.

Parameter setting for various models in experiments.

K ALIGNMENT METRIC FOR SELF-SUPERVISED REPRESENTATION LEARNING

As one of the important indicators to evaluate representation capacity, the alignment metric measures the distance among semantically similar samples in the representation space, and smaller alignment generally brings better representation capacity. Wang et al (Wang & Isola, 2020) propose a simpler approach by calculating the average distance between the positive pairs as alignment, and it can be formulated as follows:Where (z a i , z b i ) is a positive pair as discussed in Sec 2.1. We set β = 2 in the experiments.

L CONVERGENCE ANALYSIS ON TOP-1 ACCURACY

Here we show the change of Top-1 accuracy through all the training epochs in Fig 9 . During training, we take the model checkpoint after finishing each epoch to train linear classifier, and then evaluate the Top-1 accuracy on the unseen images of the test set (in either CIFAR-10 or CIFAR-100 ). In both CIFAR-10 and CIFAR-100, we could obverse that imposing the proposed uniformity metric as an auxiliary penalty loss could largely improve the Top-1 accuracy, especially in the early stage. 

M ANALYSIS ON UNIFORMITY AND ALIGNMENT

Here we show the change of uniformity and alignment through all the training epochs in Fig. 10 and Fig. 11 respectively. During training, we take the model checkpoint after finishing each epoch to evaluate the uniformity (i.e., using the proposed metric W 2 ) and alignment (Wang & Isola, 2020) on the unseen images of the test set (in either CIFAR-10 or CIFAR-100 ). In both CIFAR-10 and CIFAR-100, we could obverse that imposing the proposed uniformity metric as an auxiliary penalty loss could largely improve its uniformity. Consequently, it also lightly damage the alignment (the smaller, the better-aligned) since a better uniformity usually leads to worse alignment by definition.

N THE EXPLANATION FOR PROPERTY 5

Here we explain why the Property 5 is an inequality instead of an equality by case study. Suppose a set of data vectors (D) defined in Sec. 4.1 is with the maximum uniformity. When more dimensions with zero-value are inserted to D, the set of new data vectors (D ⊕ 0 k ) cannot achieve maximum uniformity any more, as they only occupy a small space on the surface of unit hypersphere. Therefore, the uniformity would decrease significantly with large k. To further illustrate the inequality, we visualize sampled data vectors. In Fig. 12 (a), we visualize 400 data vectors (D 1 ) sampled from N (0, I 2 ), and they almost uniformly distribute on the S 1 . We insert one dimension with zero-value to D 1 , and denote it as D 1 ⊕ 0 1 , as shown in Fig. 12(b ). In comparison with D 2 where 400 data vectors are sampled from N (0, I 3 ), as visualized in Fig. 12(c) , D 1 ⊕ 0 1 only occupy a ring on the S 2 , while D 2 almost uniformly distribute on the S 2 . Therefore, U(D 2 ) > U(D 1 ⊕ 0 1 ). Note that no matter how great/small m, the baseline uniformity metric (Wang & Isola, 2020) and our proposed uniformity metric have equal maximum uniformity, i.e., W 2 = 0 and L U = -4.0. Therefore, the maximum uniformity over various dimensions m should be equal, or at least close to, then we have U(D 1 ) ≈ U(D 2 ) > U(D 1 ⊕ 0 1 ). The Property 5 should be an inequality, and can be used to identify the capacity on capturing sensitivity to the dimensional collapse.

