WHAT DO WE MAXIMIZE IN SELF-SUPERVISED LEARNING AND WHY DOES GENERALIZATION EMERGE?

Abstract

In this paper, we provide an information-theoretic (IT) understanding of selfsupervised learning methods, their construction, and optimality. As a first step, we demonstrate how IT quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Secondly, we demonstrate how different SSL models can be (re)discovered based on first principles and highlight the underlying assumptions of different SSL variants. Based on this understanding, we present new SSL methods that are superior to existing methods in terms of performance. Third, we derive a novel generalization bound based on our IT understanding of SSL methods, providing generalization guarantees for the downstream supervised learning task. As a result of this bound, along with our unified view of SSL, we can compare the different approaches and provide general guidelines to practitioners. Consequently, our derivation and insights contribute to a better understanding of SSL and transfer learning from a theoretical and practical perspective.

1. INTRODUCTION

Self-Supervised Learning methods (SSL) learn representations using a surrogate objective between inputs and self-defined signals. In SimCLR (Chen et al., 2020) , for example, a contrastive loss is defined that makes representations for different versions of the same image similar, while making the representations for different images different. After optimizing the surrogate objective, the pre-trained model is used as a feature extractor for a downstream supervised task, such as image classification, object detection, instance segmentation and transfer learning (Caron et al., 2021; Chen et al., 2020; Misra & Maaten, 2020; Shwartz-Ziv et al., 2022) . However, despite success in practice, only a few number of authors (Arora et al., 2019; Lee et al., 2021a) have sought to provide theoretical insights about the effectiveness of SSL. In recent years, information theory methods have played a key role in several deep learning achievements, from practical applications in representation learning (Alemi et al., 2016) , to theoretical investigations (Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv, 2022) . Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities. Specifically, many works have attempted to analyze SSL from an information theory perspective. An example is the use of the renowned information maximization (InfoMax) principle (Linsker, 1988) in SSL (Bachman et al., 2019) . However, looking at these works may be confusing. Numerous objective functions are presented without a rigorous justification, some contradicting each other, as well as many implicit assumptions (Kahana & Hoshen, 2022; Wang et al., 2022; Lee et al., 2021b) Moreover, these works rely on a crucial assumption: a stochastic DN mapping, which is rarely the case nowadays. This paper presents a unified framework for SSL methods from an information theory perspective, which can be applied to deterministic DN training. We summarize our contributions into four points: (i) First, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities. (iii) Third, we present new SSL methods based on our analysis and empirically validate their superior performance. (iv) Fourth, we study how the optimization of information-theoretic quantities is related to the final performance in the downstream task using a new generalization bound.

2. BACKGROUND

Continuous Piecewise Affine (CPA) Mappings. A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0,. . . ,k -1. As we will focus on affine splines (k = 1), we define this case only for concreteness. An K-dimensional affine spline f produces its output via f (z) = ω∈Ω (A ω z + b ω )1 {z∈ω} , with input z ∈ R D and A ω ∈ R K×D , b ω ∈ R K , ∀ω ∈ Ω the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 (R D ). Spline operators and especially affine spline operators have been widely used in function approximation theory (Cheney & Light, 2009) , optimal control (Egerstedt & Martin, 2009) , statistics (Fantuzzi et al., 2002) , and related fields. Deep Networks. A deep network (DN) is a (non-linear) operator f Θ with parameters Θ that map a input x ∈ R D to a prediction y ∈ R K . The precise definitions of DNs operators can be found in Goodfellow et al. (2016) . We will omit the Θ notation for clarity unless needed. The only assumption we require for our study is that the non-linearities present in the DN are CPA, as is the case with (leaky-) ReLU, absolute value, and max-pooling. In that case, the entire input-output mapping becomes a CPA spline with an implicit partition Ω, the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero & Baraniuk, 2018) . For smooth nonlinearities, our results hold from a first-order Taylor approximation argument. Self-Supervised Learning. Joint embedding methods learn the DN parameters Θ without supervision and input reconstruction. The difficulty of SSL is to produce a good representation for downstream tasks whose labels are not available during training -while avoiding a trivially simple solution where the model maps all inputs to constant output. Many methods have been proposed to solve this problem, see Balestriero & LeCun (2022) for a summary and connections between methods. Contrastive methods learn representations by contrasting positive and negative examples, e.g. Sim-CLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018) . Other recent work introduced non-contrastive methods that employ different regularization methods to prevent collapsing of the representation. Several papers used stop-gradients and extra predictors to avoid collapse (Chen & He, 2021; Grill et al., 2020) while Caron et al. (2020) uses an additional clustering step. As opposed to contrastive methods, noncontrastive methods do not explicitly rely on negative samples. Of particular interest to us is the VICReg method (Bardes et al., 2021) that considers two embedding batches Z = [f (x 1 ), . . . , f (x N )] and Z ′ = [f (x ′ 1 ), . . . , f (x ′ N )] each of size (N × K). Denoting by C the (K × K) covariance matrix obtained from [Z, Z ′ ] we obtain the VICReg triplet loss L= 1 K K k=1   α max 0, γ -C k,k + ϵ +β k ′ ̸ =k (C k,k ′ ) 2   + γ∥Z -Z ′ ∥ 2 F /N. Deep Networks and Information-Theory. Recently, information-theoretic methods have played a key role in several remarkable deep learning achievements (Alemi et al., 2016; Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv & Tishby, 2017) . Moreover, different deep learning problems have been successfully approached by developing and applying informationtheoretic estimators and learning principles (Hjelm et al., 2018; Belghazi et al., 2018; Piran et al., 2020; Shwartz-Ziv et al., 2018) . There is, however, a major problem when it comes to analyzing information-theoretic objectives in deterministic deep neural networks: the source of randomness. The mutual information between the input and the representation in such networks is infinite, resulting in ill-posed optimization problems or piecewise constant, making gradient-based optimization methods ineffective (Amjad & Geiger, 2019) . To solve these problems, researchers have proposed several solutions. For SSL, stochastic deep networks with variational bounds could be used, where the output of the deterministic network is used as parameters of the conditional distribution (Lee et al., 2021b; Shwartz-Ziv & Alemi, 2020) . Dubois et al. (2021) suggested another option, which assumed that the randomness of data augmentation among the two views is the source of stochasticity in the network. Another line of works assume a random input, but not using any properties of the distribution of the newtork's output in order to analysis the netwokr's objective, which rely on general lower bounds (Wang & Isola, 2020; Zimmermann et al., 2021) . For supervised learning, Goldfeld et al. (2018) introduced an auxiliary (noisy) DN framework by injecting additive noise into the model and demonstrated that it is a good proxy for the original (deterministic) DN in terms of both performance and representation. Finally, Achille & Soatto (2018) found that minimizing a stochastic network with a regularizer is equivalent to minimizing cross-entropy over deterministic DNs with multiplicative noise. However, all of these methods assume that the noise comes from the model itself, which contradicts current training methods. In this work, we explicitly assume that the stochasticity comes from the data, which is a less restrictive assumption and does not require changing current algorithms.

3. INFORMATION THEORY FOR DETERMINISTIC DEEP NETWORKS

This section first sets up notation and assumption on the information-theoretic challenges in SSL (section 3.1) and on our assumptions regarding the data distribution (section 3.2) so that any training sample x can be seen as coming from a single Gaussian distribution as in x ∼ N (µ x , Σ x ). From this we obtain that the output of any deep network f (x) corresponds to a mixture of truncated Gaussian (section 3.3). In particular, it can fall back to a single Gaussian under small noise (det(Σ) → ϵ) assumptions. These results will enable information measures to be applied to deterministic DNs. We then recover known SSL methods (Bardes et al., 2021; Chen et al., 2020) by making different assumptions about the data distribution and estimating their information.

3.1. SSL AS AN INFORMATION-THEORETIC PROBLEM

To better grasp the difference between key SSL methods, we first formulate the general SSL goal from an information-theoretical perspective. We start with the MultiView InfoMax principle, i.e., maximizing the mutual information between the representations. Let X and X ′ be two different views and Z and Z ′ their corresponding representations. As shown in Federici et al. (2020) , to maximize their information, we maximize I(Z; X ′ ) and I(Z ′ ; X) using the lower bound I(Z, X ′ ) = H(Z) -H(Z|X ′ ) ≥ H(Z) + E x ′ [log q(z|x ′ )] where H(Z) is the entropy of Z. In supervised learning, where we need to maximize I(Z; Y ), the labels (Y ) are fixed, the entropy term H(Y ) is constant, and you only need to optimize the log-loss E x ′ [log q(z|x)] (cross-entropy or square loss). However, it is well known that for Siamese networks there exists a degenerate solution, in which all outputs "collapse" into an undesired value (Chen et al., 2020) . Looking at eq. ( 2) we can see that the entropies are not constant and can be optimized throughout the learning process. Therefore, only minimizing the log loss will cause it to collapse to the trivial solution of making the representations constant (where the entropy goes to zero). To regularize these entropies, that is, prevent collapse, different methods utilize different approaches to implicit regularizing information. To recover them in section 4, we must first introduce the results around the data distribution (section 3.2) and how a DN transforms that distribution (section 3.3).

3.2. DATA DISTRIBUTION HYPOTHESIS

Our first step is to assess how the output random variables of the network are represented, assuming a distribution on the data itself. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data (Fefferman et al., 2016) . Therefore, we will consider throughout this study the conditioning of a latent representation with respect to the mean of the observation, i.e., X|x * ∼ N (x * , Σ x * ) where the eigenvectors of Σ x * are in the same linear subspace than the tangent space of the data manifold at x * , which varies with the position of x * in space. Hence a dataset is considered to be a collection of {x * n , n = 1, . . . , N } and the full data distribution to be a sum of low-rank covariance Gaussian densities, as in X ∼ N n=1 N (x * n , Σ x * n ) 1 {T =n} , T ∼ Cat(N ), with T the uniform Categorical random variable. For simplicity, we consider that the effective support of N (x * i , Σ x * i ) and N (x * j , Σ x * j ) do not overlap, where the effective support is defined as {x ∈ R D : p(x) > ϵ} This keeps things general, as it is enough to cover the domain of the data manifold overall, without overlap between different Gaussians. Therefore, we have that. p(x) ≈ N x; x * n(x) , Σ x * n(x) /N, where N (x; ., .) is the Gaussian density at x and with n(x) = arg min n (x - x * n ) T Σ x * n (x -x * n ). This assumption, that a dataset is a mixture of Gaussians with non-overlapping support, will simplify our derivations below, and could be extended to the general case if needed.

3.3. DATA DISTRIBUTION AFTER DEEP NETWORK TRANSFORMATION

Consider an affine spline operator f (Eq. 1) that goes from a space of dimension D to a space of dimension K with K ≥ D. The span, that we denote as image, of this mapping is given by Im(f ) ≜ {f (x) : x ∈ R D } = ω∈Ω Aff(ω; A ω , b ω ) (5) with Aff(ω; A ω , b ω ) = {A ω x + b ω : x ∈ ω} the affine transformation of region ω by the per-region parameters A ω , b ω , and with Ω the partition of the input space in which x lives in. The practical computation of the per-region affine mapping can be obtained by setting A ω to the Jacobian matrix of the network at the corresponding input x, and b to be defined as f (x) -A ω x. Therefore, the DN mapping consists of affine transformations on each input space partition region ω ∈ Ω based on the coordinate change induced by A ω and the shift induced by b ω . When the input space is equipped with a density distribution, this density is transformed by the mapping f . In general, finding the density of f (X) is an intractable task. However, given our disjoint support assumption provided in section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes N . In doing so, the support of each Gaussian is included with the region ω in which its means lie in, leading to the following result. Theorem 1. Given the setting of eq. (4) the unconditional DN output density denoted as Z is approximately a mixture of the affinely transformed distributions x|x * n(x) e.g. for the Gaussian case Z∼ N n=1 N A ω(x * n ) x * n + b ω(x * n ) , A T ω(x * n ) Σ x * n A ω(x * n ) 1 {T =n} , where ω(x * n ) = ω ∈ Ω ⇐⇒ x * n ∈ ω is the partition region in which the prototype x * n lives in. Proof. The proof of of Theorem 1 is presented in Appendix A.

4. INFORMATION OPTIMIZATION AND OPTIMALITY

Next, we will show how SSL algorithms for deterministic networks can be derived. According to Section 3.1, we want to maximize I(Z; X ′ ) and I(Z ′ ; X). Although this mutual information is intractable in general, we can obtain a tractable variational estimation using the expected loss. First, when our input noise is small, namely that the effective support of the Gaussian centered at x is contained within the region w of the DN's input space partition, we can reduce the conditional output density to a single Gaussian: (Z ′ |X ′ = x n ) ∼ N (µ(x n ), Σ(x n )) , where µ(x n ) = A ω(xn) x n + b ω(xn) and Σ(x n ) = A T ω(xn) Σ xn A ω(xn) . Second, In order to compute the expected loss, we need Learning Rate Data Learning Rate Centroids Figure 1 : Left: The network output SSL training is more Gaussian for small input noise. The P-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99% confidence. Right: Smaller learning rates prevent collapsing. GMM points where in black is the entropy, in blue and red are the data points and GMM centroids respectively, with the corresponding learning rate to marginalize out the stochasticity in the output of the network. In general, training with squared loss is equivalent to assuming a Gaussian observation model p(z|z ′ ) ∼ N (z ′ , Σ r ), where Σ r = I. To compute the expected loss over samples of x ′ , we need to marginalize out the stochasticity in Z ′ : which means that the conditional decoder is a Gaussian - (Z|X ′ = x n ) ∼ N (µ(x n ), Σ r + Σ(x n )). However, the expected log loss over samples of Z is hard to compute, and therefore we focused on its lower bound, the expected log loss over samples of Z ′ . For simplicity, we set Σ r = I which gives us: E x ′ [log q(z|x ′ )] ≥ E z ′ |x ′ [log q(z|z ′ )] = d 2 log 2π - 1 2 (z -µ(x ′ )) 2 - 1 2 T r log Σ(x ′ ) and now we can take the expectation over Z: E z|x E z ′ |x ′ [log q(z|z ′ )] = d 2 log 2π - 1 2 (µ(x) -µ(x ′ )) 2 - 1 2 log (|Σ(x)| • |Σ(x ′ )|) Full derivations of eq. ( 7) and eq. ( 8) are presented in Appendix B. Combine all the above give us I(Z; X ′ ) ≥ H(Z) + E x,z|x,x ′ ,z ′ |x ′ [log q(z|z ′ )] (9) = H(Z) + d 2 log 2π - 1 2 E x,x ′ (µ(x) -µ(x ′ )) 2 + log (|Σ(x)| • |Σ(x ′ )|) To optimize it in practice, we can approximate p(x, x ′ ) using the empirical data distribution: L ≈ 1 N N i=1 H(Z) - 1 2 (µ(x i ) -µ(x ′ i )) 2 - 1 2 log (|Σ(x i )| • |Σ(x ′ i )|) Next, we will discuses how the estimation of the intractable entropy H(Z) effect our objective.

4.1. DERIVING VICREG FROM FIRST PRINCIPLES

As a result of eq. ( 11), we can reconstruct VICReg from first principles. Unfortunately, H(Z) cannot be determined explicitly. However, there are several approximations in the literature (Kolchinsky & Tracey, 2017; Huber et al., 2008) . For a detailed discussion about the different entropy estimator, see appendix C. A simpler solution is to approximate the entire mixture by capturing the first two moments of the distribution, which provides an upper bound on the entropy. Note that we are optimizing an upper bound, which means we do not have a formal guarantee, and could lead to an arbitrary increase in our estimator. In practice, there are cases where we can achieve good results by maximizing a lower bound (Martinez et al., 2021; Nowozin et al., 2016) , even though this may cause instability in the training process. Using Σ Z as the covariance matrix of Z, we will maximize: L ≈ N n=1 log |Σ Z | |Σ(x i )| • |Σ(x ′ i )| - 1 2 (µ(x) -µ(x ′ )) 2 (12) Optimizing the log determinate of Z means maximizing its log eigenvalues. Although it is theoretically possible to differentiate eigendecomposition, this leads to numerical instability (Dang et al., 2018) . While many works have attempted to address this issue (Giles, 2008; Ionescu et al., 2015) , VICReg is using a straightforward approach. Because the eigenvalues of a diagonal matrix are the diagonal, increasing the sum of the log-diagonal terms is equivalent to increasing the sum of the log eigenvalues. One approach is to set the off-diagonal terms of Σ Z to zero. However, VICReg maximizes the sum of the diagonal term instead of the log of diagonal terms, which is an upper. An exciting research direction is to maximize the eigenvalues of Z using more sophisticated methods, such as using a differential expression for eigendecomposition.

4.2.1. VALIDATION OF OUR ASSUMPTIONS

Based on the theory presented in Section 3.3, the conditional output density p z|x=i reduces to a single Gaussian with decreasing input noise. We validated it using a ResNet-18 model trained with SimCLR or VICReg on the CIFAR-10 dataset (Krizhevsky, 2009) . From the test dataset, we sample 512 Gaussian samples for each image and analyzed whether each sample remains Gaussian in the penultimate layer of the DN. Then, we employ the D'Agostino and Pearson's test (D'Agostino, 1971) . Figure 1 (left) shows the p-value as a function of the normalized standard deviation. For small noise, we can reject the hypothesis that the conditional output density of the network is not Gaussian (85% for VICReg). Increasing the input noise causes the network's output to become less Gaussian. Although the results indicate that the output of the network is Gaussian, even for the small noise regime, there is a 15% of Type I error. The next step is to try to confirm our assumption that the model of the data distribution has nonoverlapping effective support. We calculate the distribution of pairwise l 2 distances between images for seven datasets: MNIST, CIFAR10, CIFAR100, Flowers102, Food101, FGVAircaft. In Figure appendix D, we can see that even for raw pixels, the pairwise distances are far from zero, which means you can use a small Gaussian around each point without overlapping. Therefore, the effective support of these datasets are not-overlapping, and our assumption is realistic.

4.2.2. OPTIMIZING THE MUTUAL INFORMATION OBJECTIVE

Implementing Eq 9 in practice requires many "design choices". In section 4.1, we discuss how VICReg uses an approximation of the entropy that is both loose and an upper bound on the true entropy. Next, we suggest combining the VICReg invariance term with different methods for optimizing the entropy. Estimators. The VICReg objective aims to approximate the log determinate of the empirical covariance matrix by using diagonal terms. However, this estimator can be problematic Huber et al. (2008) . Instead, we use the LogDet Entropy Estimator Zhouyin & Liu (2021) , which provides a tighter upper bound. This estimator is still an upper bound on entropy, which does not provide any guarantee. To address this problem, we also use a lower bound, based on the pairwise distances of the individual Gaussians (Kolchinsky & Tracey, 2017) . These proposed methods are compared with recent SSL methods -SimCLR (Chen et al., 2020) and Barlow Twin (Zbontar et al., 2021) . Setup Our experiments are conducted on CIFAR-10 Krizhevsky et al. (2009) . We use ResNet-18 (He et al., 2016) as our backbone. We use linear evaluation for the quality of the representation. For full details see Appendix E. Results. It can be seen from Table E that the proposed estimators outperform both the original VICReg and SimCLR as well as Barlow Twin. By estimating the entropy with a more accurate estimator, we can improve the results of VICReg, and the pairwise distance estimator, which is a lower bound, achieves the best results. This aligns with the theory that we want to maximize a lower bound on true entropy. The results of our study suggest that a smart selection of entropy estimators, inspired by our framework, leads to better results.

5. SELF SUPERVISED LEARNING, EM AND INFORMATION

Several SSL methods employ the stop gradient operator and only train with positive pairs of data (Grill et al., 2020; Chen & He, 2021) . According to (Chen & He, 2021) , presetting the stop gradient operation implicitly involves presenting two sets of variables where the algorithm alternates between optimizing each set. Next, we formalize these SSL methods as generalized EM optimization problems, link them to information theory, and analyze how specific design choices affect their collapse.

5.1. THE EM ALGORITHM AND SELF SUPERVISED LEARNING

The classical approach to learning with hidden variables is based on the Expectation Maximization (EM) algorithm (Dempster et al., 1977) . Neal & Hinton (1998) showed that we can view it as a dual optimization where both steps are seen as maximizing the same function, F ( P , θ) = E P [P (Z, Z ′ |θ)] + H( P ) where H( P ) = -E P log P (z ′ ) is the entropy of the empirical distribution P and E P [P (Z, Z ′ |θ)] is the regular likelihood. Using this formulation, Neal & Hinton (1998) showed that the (G)EM algorithm maximizes a variational lower bound on the log likelihood. However, as discussed in section 3.1, optimize the likelihood can be problematic when both variables are changing. Unlike the classic EM algorithm, for SSL, our input variable Z changes in each iteration, and the optimization is with respect to both Z and Z ′ .

5.2. PREVENTING POINT COLLAPSE UNDER THE EM ALGORITHM

For Gaussian mixture models (GMMs), clustering consists of estimating the parameters that maximize its likelihood function, followed by assigning to each data point the cluster corresponding to its most likely multivariate Gaussian distribution. Chen & He (2021) suggested that the SimSiam method can be viewed as the K-means algorithm, which can be derived by reducing the GMMs Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (Figure 1 -right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian's mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (Figure 4 -left, in the Appendix). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (Figure 4 -right, in the Appendix). To prevent collapse, we follow the K-means algorithm in enforcing sparse posteriors, i.e. using small initial standard deviations and learning only the mean. This forces a one-to-one mapping which leads all points to be closest to the mean without collapsing, resulting in high entropy (Figure 4 middle, in the Appendix). Another option to prevent collapse is to use different learning rates for input and parameters. Using this setting, the collapsing of the parameters does not maximize the likelihood. Figure 1 (right) shows the results of GMM with different learning rates for learned inputs and parameters. When the parameter learning rate is sufficiently high in comparison to the input learning rate, the entropy decreases much more slowly and no collapse occurs.

6. BENEFITS OF INFORMATION MAXIMIZATION FOR GENERALIZATION

The purpose of this section is to further connect the invariance loss, the covariance matrix, and the information with the input to the generalization ability of the model by deriving a novel generalization bound. Together with the results from the previous sections, this provides a mathematical understanding of the benefits of SSL through maximization of information with implicit regularization.

6.1. NOTATION

Let x be our input and y ∈ R r the output. We are given a labeled training data S = ((x i , y i )) n i=1 of size n and an unlabeled training data S = ((x + i , x ++ i )) m i=1 of size m, where x + i and x ++ i share the same (unknown) label. With the unlabeled training data, we define the invariance loss I S (f θ ) = 1 m m i=1 ∥f θ (x + i ) -f θ (x ++ i )∥ where f θ is the trained representation on the unlabeled data S. We define a labeled loss ℓ x,y (w) = ∥W f θ (x) -y∥ where w = vec[W ] ∈ R dr is the vectorization of the matrix W ∈ R r×d . Let w S = vec[W S ] be the minimum norm solution as W S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min W 1 n n i=1 ∥Wf θ (x i ) -y i ∥ 2 . We also define the representation matrices Z S = [f (x 1 ), . . . , f (x n )] ∈ R d×n and Z S = [f (x + 1 ), . . . , f (x + m )] ∈ R d×m , and the projection matrices P Z S = I -Z S ⊤ (Z S Z S ⊤ ) † Z S and P Z S = I -Z S ⊤ (Z S Z S ⊤ ) † Z S . We define the label matrix Y S = [y 1 , . . . , y n ] ⊤ ∈ R n×r and the unknown label matrix Y S = [y + 1 , . . . , y + m ] ⊤ ∈ R m×r , where y + i is the unknown label of x + i . Let F be a hypothesis space of f θ . For a given hypothesis space F, we define the normalized Rademacher complexity Rm (F) = 1 √ m E S,ξ [sup f ∈F m i=1 ξ i ∥f (x + i ) -f (x ++ i )∥], where , ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1, 1}. It is normalized such that Rm (F) = O(1) as m → ∞ for typical choices of hypothesis spaces F, including DNs (Bartlett et al., 2017; Kawaguchi et al., 2018) .

6.2. GENERALIZATION BOUND FOR VICREG

We now show that SSL via VICReg can be understood to improve the generalization ability for the supervised downstream task. Namely, Theorem 2 shows that the expected labeled loss E x,y [ℓ x,y (w S )] is minimized when we minimize the unlabeled invariance loss I S (f θ ) while controlling the covariance Z S Z S ⊤ and the complexity of representations Rm (F): Theorem 2. (Informal version). For any δ > 0, with probability at least 1 -δ, the following holds: E x,y [ℓ x,y (w S )] ≤ I S (f θ ) + 2 √ m ∥P Z S Y S ∥ F + 1 √ n ∥P Z S Y S ∥ F + 2 Rm (F) √ m + Q m,n , where Q m,n = O(G ln(1/δ) m + ln(1/δ) n ) → 0 as m, n → ∞. In Q m,n , the value of G for the term decaying at the rate 1/ √ m depends on the hypothesis space of f θ and w whereas the term decaying at the rate 1/ √ n is independent of any hypothesis space. Proof. The complete version of Theorem 2 and its proof are presented in Appendix G. Note that our framework holds for a classification with a linear layer and the l 2 norm as the loss. Also, in order that this bounds will not become vacuous we should imposed that the class F has a finite norm range and that the class of matrices for the linear layer W is of finite norm. The term ∥P Z S Y S ∥ F in Theorem 2 contains the unobservable label matrix Y S . However, we can minimize this term by using ∥P Z S Y S ∥ F ≤ ∥P Z S ∥ F ∥Y S ∥ F and by minimizing ∥P Z S ∥ F . The factor ∥P Z S ∥ F is minimized when the rank of the covariance Z S Z S ⊤ is maximized. Since a strictly diagonally dominant matrix is non-singular, this can be enforced by maximizing the diagonal entries while minimizing the off-diagonal entries, as is done in VICReg. For example, if d ≥ n, then ∥P Z S ∥ F = 0 when the covariance Z S Z S ⊤ is of full rank. The term ∥P Z S Y S ∥ F contains only observable variables and we can directly measure the value of this term using training data. In addition, the term ∥P Z S Y S ∥ F is also minimized when the rank of the covariance Z S Z S ⊤ is maximized. Since the covariances Z S Z S ⊤ and Z S Z S ⊤ concentrate to each other via concentration inequalities with the error in the order of O( (ln(1/δ))/n + Rm (F) (ln(1/δ))/m), we can also minimize the upper bound on ∥P Z S Y S ∥ F by maximizing the diagonal entries of Z S Z S ⊤ while minimizing its off-diagonal entries, as is done in VICReg. Thus, VICReg can be understood as a method to minimize the generalization bound in Theorem 2 by minimizing the invariance loss while controlling the covariance Z S Z S ⊤ to minimize the labelagnostic upper bounds on ∥P Z S Y S ∥ F and ∥P Z S Y S ∥ F . If we know partial information about the label Y S of the unlabeled data, we can use it to minimize ∥P Z S Y S ∥ F and ∥P Z S Y S ∥ F directly. This direction can be used to improve VICReg in future work for the partial observable setting.

6.3. UNDERSTANDING VIA MUTUAL INFORMATION

Theorem 2 together with the result of the previous section shows that, for generalization in the downstream task, it is helpful to maximize the mutual information I(Z; X ′ ) in SSL via minimizing the invariance loss I S (f θ ) while controlling the covariance Z S Z S ⊤ . The term 2 Rm(F ) √ m captures the importance of controlling the complexity of the representations f θ . To understand this term further in terms of mutual information, let us consider a discretization of the parameter space of F to have finite |F| < ∞ (indeed, a computer always implements some discretization of continuous variables). Then, by Massart's Finite Class Lemma, we have that Rm (F) ≤ C ln |F| for some constant C > 0. Moreover, Shwartz-Ziv (2022) shows that we can approximate ln |F| by 2 I(Z;X) . Thus, in Theorem 2, the term I S (f θ ) + 2 √ m ∥P Z S Y S ∥ F + 1 √ n ∥P Z S Y S ∥ F corresponds to I(Z; X ′ ) while the term of 2 Rm(F ) √ m corresponds to I(Z; X). Recall that the information can be decomposed as I(Z; X) = I(Z; X ′ ) + I(Z; X|X ′ ). ( ) where we want to maximize the predictive information I(Z; X ′ ), while minimizing I(Z; X) (??). Thus, in order to improve generalization, we also need to control 2 Rm(F ) √ m to restrict the superfluous information I(Z; X|X ′ ), in addition to minimize I S (f θ ) + 2 √ m ∥P Z S Y S ∥ F + 1 √ n ∥P Z S Y S ∥ F that corresponded to maximize the predictive information I(Z; X ′ ). Although we can explicitly add regularization on I(Z; X|X ′ ) to control 2 Rm(F ) √ m , it is possible that I(Z; X|X ′ ) and 2 Rm(F ) √ m are implicitly regularized via implicit bias through e design choises (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018) . Thus, Theorem 2 connects the information-theoretic understanding of SSL with the probabilistic guarantee on the generalization ability.

6.4. COMPARING GENERALIZATION BOUNDS

The generalization bound of SimCLR (Saunshi et al., 2019) requires the number of label classes to go infinity to make the generalization gap decrease towards zero. In contrast, the bound on VICReg in Theorem 2 does not require the number of label classes to approach infinity to let the generalization gap go to zero. This reflects the fact that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function that is based on the implicit expectation that the labels of a negative pair (y + , y -) are different. Another difference is that our VICReg bound improves as n increases, while the previous bound of SimCLR (Saunshi et al., 2019) does not depend on n. This is because the previous work assumes partial access to the true distribution of x given y per class for setting W , which removes the importance of labeled data size n and is not assumed in our study. Consequently, our bound provides a new insight for VICReg regarding the ratio of the effects of m v.s. n through G ln(1/δ)/m + ln(1/δ)/n. Finally, Theorem 2 also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, Rn (W)/ √ n and Rn (F)/ √ n, with respect to the size of labeled data n, instead of the size of unlabeled data m. Here, Rn (W) is the the normalized Rademacher complexity for the hypothesis space of w. Thus, Theorem 2 shows that using SSL, we can replace all the complexities of hypothesis spaces in terms of n with those in terms of m. Since m is typically much larger than n, this illuminates the benefit of SSL.

7. CONCLUSIONS

In this study, we examine SSL's objective function from an information-theoretic perspective. Based on transfering of the required stochasticity to the input distribution, we show how SSL objectives can be derived. Thus, even when using deterministic DNs, it is possible to perform an informationtheoretic analysis. The second part of the paper rediscovered SSL loss functions from first principles and demonstrated their implicit assumptions. We empirically validated our analysis and confirmed the validity of our novel understanding. As a result of our analysis, we have proposed new SSL algorithms that perform better than existing ones. Furthermore, we derived a generalization bound on the downstream task, tight it to known information objeective terms and demonstrate that VICReg minimizes it. In addition, our work opens many new avenues for future research, including a better estimation of information-theoretic quantities that are consistent with our assumptions and identifying which SSL method is the most appropriate according to data characteristics. In addition, our probabilistic guarantee suggests that VICReg can be further improved for the setting of partial label information by aligning the covariance matrix with the partially observable label matrix.

A DATA DISTRIBUTION AFTER DEEP NETWORK TRANSFORMATION

Theorem 3. Given the setting of eq. (4) the unconditional DN output density denoted as Z approximates (given the truncation of the Gaussian on its effective support that is included within a single region ω of the DN's input space partition) a mixture of the affinely transformed distributions x|x * n(x) e.g. for the Gaussian case Z∼ N n=1 N A ω(x * n ) x * n + b ω(x * n ) , A T ω(x * n ) Σ x * n A ω(x * n ) T =n , where ω(x * n ) = ω ∈ Ω ⇐⇒ x * n ∈ ω is the partition region in which the prototype x * n lives in. Proof. We know that If ω p(x|x * n(x) )dx ≈ 1 then f is linear within the effective support of p. Therefore, any sample from p will almost surely lie within a single region ω ∈ Ω and therefore the entire mapping can be considered linear with respect to p. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

B LOWER BOUNDS ON

E x ′ [log q(z|x ′ )] In this appendix we present the full derivation of the lower bound on E x ′ [log q(z|x ′ )]. Because Z ′ |X ′ is a Gaussian, we can write it as Z ′ = µ(x ′ ) L(x ′ )ϵ where ϵ ∼ N (0, 1) and L(x ′ ) T L(x ′ ) = Σ(x ′ ). Now, setting Σ r = I, will give us: E x ′ [log q(z|x ′ )] ≥ (15) E z ′ |x ′ [log q(z|z ′ )] = (16) E z ′ |x ′ d 2 log 2π - 1 2 (z -z ′ ) T (I)) -1 (z -z ′ ) = (17) d 2 log 2π - 1 2 E z ′ |x ′ , (z -z ′ ) 2 = (18) d 2 log 2π - 1 2 E ϵ (z -µ(x ′ ) -L(x ′ )ϵ) 2 = (19) d 2 log 2π - 1 2 E ϵ (z -µ(x ′ )) 2 -2 (z -µ(x ′ ) * L(x ′ )ϵ) + (L(x ′ )ϵ) T (L(x ′ )ϵ) = (20) d 2 log 2π - 1 2 E ϵ (z -µ(x ′ )) 2 + (z -µ(x ′ )L(x ′ )) E ϵ [ϵ] - 1 2 E ϵ ϵ T L(x ′ ) T L(x ′ )ϵ = (21) d 2 log 2π - 1 2 (z -µ(x ′ )) 2 - 1 2 T r log Σ(x ′ ) where E x ′ [log q(z|x ′ )] = E x ′ log E z ′ |x ′ [q(z|z ′ )] ≥ E z ′ [log q(z|z ′ )] by Jensen's inequality, E ϵ [ϵ] = 0 and E ϵ ϵ L(x ′ ) T L(x ′ ϵ = T r log Σ(x ′ ) by the Hutchinson's estimator. E z|x E z ′ |x ′ [log q(z|z ′ )] = (23) E z|x d 2 log 2π - 1 2 (z -µ(x ′ )) 2 - 1 2 T r log Σ(x ′ ) = (24) d 2 log 2π - 1 2 E z|x (z -µ(x ′ )) 2 - 1 2 T r log Σ(x ′ ) = (25) d 2 log 2π - 1 2 E ϵ (µ(x) + L(x)ϵ -µ(x ′ )) 2 - 1 2 T r log Σ(x ′ ) = (26) d 2 log 2π - 1 2 E ϵ (µ(x) -µ(x ′ )) 2 + E ϵ [(µ(x) -µ(x ′ )) L(x)ϵ] (27) - 1 2 E ϵ ϵ T L(x) T L(x)ϵ - 1 2 T r log Σ(x ′ ) = (28) d 2 log 2π - 1 2 (µ(x) -µ(x ′ )) 2 - 1 2 T r log Σ(x) - 1 2 T r log Σ(x ′ ) = (29) d 2 log 2π - 1 2 (µ(x) -µ(x ′ )) 2 - 1 2 log (|Σ(x)| • |Σ(x ′ )|)

C ENTROPY ESTIMATORS

The estimation of entropy is one of the classic problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds (Huber et al., 2008) . Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is a very computationally expensive and typically requires a large number of samples, especially in high dimensions (Brewer, 2017) . Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases Huber et al. (2008) . Another options is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) (Ahmed & Gokhale, 1989) , and bayesian methods Misra et al. (2005) . These methods, however, often require complex optimizations. The LogDet estimator presented in Zhouyin & Liu (2021) used the differential entropy α order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, Huber et al. (2008) presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky & Tracey (2017) introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities defined for each member. These estimators are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are are exact when the component distributions are grouped into well-separated clusters,

D EMPIRICAL VALIDATION OF OUR ASSUMPTION

We will try to verify empirically our assumptions on different datasets We compute the pairwise l2 distances between images for seven datasets: MNIST, CIFAR10, CIFAR100, Flowers102, Food101, and FGVAircaft. We found that even for raw pixels, the pairwise distances are far from zero, which means you can use a small Gaussian around each point without overlapping. Consequently, the effective supports of these high-dimensional datasets are not overlapping, and our assumption is realistic even for current popular SSL datasets..

E EXPERIMENTAL VERIFICATION OF INFORMATION-BASED BOUND OPTIMIZATION

Setup Our experiments are conducted on CIFAR-10 Krizhevsky et al. (2009) . We use ResNet-18 (He et al., 2016) as our backbone. Each model is trained with 512 batch size for 800 epochs. We use linear evaluation to assess the quality of the representation. Once the model has been pre-trained, we follow the same fine-tuning procedures as for the baseline methods (Caron et al., 2020) .

F EXPECTATION MAXIMIZATION AND COLLAPSING G ON BENEFITS OF INFORMATION MAXIMIZATION FOR GENERALIZATION

In this Appendix, we present the complete version of Theorem 2 along with its proof and additional discussions. 

M N I S T C I F

A R 1 0 C I F A R 1 0 0 F l o w e r s 1 0 2 F o o d 1 0 1 F G V C A i r c W S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min W 1 m m i=1 ∥Wf θ (x + i ) -g * (x + i )∥ 2 . Let κ S be a data-dependent upper bound on the per-sample Euclidian norm loss with the trained model as ∥W S f θ (x) -y∥ ≤ κ S for all (x, y) ∈ X × Y. Similarly, let κ S be a data-dependent upper bound on the per-sample Euclidian norm loss as ∥W S f θ (x) -y∥ ≤ κ S for all (x, y) ∈ X × Y. Define the difference between W S and W S by c = ∥W S -W S ∥ 2 . Let W be a hypothesis space of W such that W S ∈ W. We denote by Rm (W • F) = 1 √ m E S,ξ [sup W ∈W,f ∈F m i=1 ξ i ∥g * (x + i ) -Wf (x + i )∥] the normalized Rademacher complexity of the set {x + → ∥g * (x + ) -Wf (x + )∥ : W ∈ W, f ∈ F}. we denote by κ a upper bound on the per-sample Euclidian norm loss as ∥W f (x) -y∥ ≤ κ for all (x, y, W, f ) ∈ X × Y × W × F. We adopt the following data-generating process model that is used in the previous paper on analyzing contrastive learning (Saunshi et al., 2019) . For the labeled data, first, y is drawn from the distritbuion H(Z)=45 data centroids H(Z)=40 data centroids H(Z)=34 data centroids Figure 4 : Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left -In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right -when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy. ρ on Y, and then x is drawn from the conditional distribution D y conditioned on the label y. That is, we have the join distribution D(x, y) = D y (x)ρ(y) with ((x i , y i )) n i=1 ∼ D n . For the unlabeled data, first, each of the unknown labels y + and y -is drawn from the distritbuion ρ, and then each of the positive examples x + and x ++ is drawn from the conditional distribution D y + while the negative example x -is drawn from the D y -. Unlike the analysis of contrastive learning, we do not require the negative samples. Let τ S be a data-dependent upper bound on the invariance loss with the trained representation as ∥f θ (x) -f θ (x)∥ ≤ τ S for all (x, x) ∼ D 2 y and y ∈ Y. Let τ be a data-independent upper bound on the invariance loss with the trained representation as∥f (x) -f (x)∥ ≤ τ for all (x, x) ∼ D 2 y , y ∈ Y, and f ∈ F. For the simplicity, we assume that there exists a function g * such that y = g * (x) ∈ R r for all (x, y) ∈ X × Y. Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes n and m increase, assuming that the mean of the label noise is zero.

G.2 RESULT

The following theorem is the complete version of Theorem 2: Theorem 4. For any δ > 0, with probability at least 1 -δ, the following holds: E x,y [ℓ x,y (w S )] ≤ cI S (f θ ) + 2 √ m ∥P Z S Y S ∥ F + 1 √ n ∥P Z S Y S ∥ F + Q m,n , where Q m,n = c 2 Rm (F) √ m + τ ln(3/δ) 2m + τ S ln(3/δ) 2n + κ S 2 ln(6|Y|/δ) 2n y∈Y p(y) + p(y) + 4R m (W • F) √ m + 2κ ln(4/δ) 2m + 2κ S ln(4/δ) 2n . Proof. The complete proof is presented in Appendix G.3. The bound in the complete version of Theorem 4 is better than the one in the informal version of Theorem 2, because of the factor c. The factor c measures the difference between the minimum norm solution W S of the labeled training data and the minimum norm solution W S of the unlabeled training data. Thus, the factor c also decreases towards zero as n and m increase. Moreover, if the labeled and unlabeled training data are similar, the value of c is small, decreasing the generalization bound further, which makes sense. Thus, we can view the factor c as a measure on the distance between the labeled training data and the unlabeled training data. We obtain the informal version from the complete version of Theorem 2 by the following reasoning to simplify the notation in the main text. We have that cI S (f θ ) + c 2 Rm(F ) √ m = I S (f θ ) + 2 Rm(F ) √ m + Q, where Q = (c -1)(I S (f θ ) + 2 Rm(F ) √ m ) ≤ ς → 0 as as m, n → ∞, since c → 0 as m, n → ∞. However, this reasoning is used only to simplify the notation in the main text. The bound in the complete version of Theorem 2 is more accurate and indeed tighter than the one in the informal version. In Theorem 2, Q m,n → 0 as m, n → ∞ if Rm(F ) √ m → 0 as m → ∞. Indeed, this typically holds because Rm (F) = O(1) as m → ∞ for typical choices of F, including deep neural networks (Bartlett et al., 2017; Kawaguchi et al., 2018; Golowich et al., 2018) as well as other common machine learning models (Bartlett & Mendelson, 2002; Mohri et al., 2012; Shalev-Shwartz & Ben-David, 2014) .

G.3 PROOF OF THEOREM 2

Proof of Theorem 2. Let W = W S where W S is the the minimum norm solution as W S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min W 1 n n i=1 ∥W f θ (x i ) -y i ∥ 2 . Let W * = W S where W S is the minimum norm solution as W * = W S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min W 1 m m i=1 ∥W f θ (x + i ) -g * (x + i )∥ 2 . Since y = g * (x), y = g * (x) ± W * f θ (x) = W * f θ (x) + (g * (x) -W * f θ (x)) = W * f θ (x) + φ(x) where φ(x) = g * (x) -W * f θ (x). Define L S (w) = 1 n n i=1 ∥W f θ (x i ) -y i ∥. Using these, L S (w) = 1 n n i=1 ∥W f θ (x i ) -y i ∥ = 1 n n i=1 ∥W f θ (x i ) -W * f θ (x i ) -φ(x i )∥ ≥ 1 n n i=1 ∥W f θ (x i ) -W * f θ (x i )∥ - 1 n n i=1 ∥φ(x i )∥ = 1 n n i=1 ∥ W f θ (x i )∥ - 1 n n i=1 ∥φ(x i )∥ where W = W -W * . We now consider new fresh samples xi ∼ D yi for i = 1, . . . , n to rewrite the above further as: L S (w) ≥ 1 n n i=1 ∥ W f θ (x i ) ± W f θ (x i )∥ - 1 n n i=1 ∥φ(x i )∥ = 1 n n i=1 ∥ W f θ (x i ) -( W f θ (x i ) -W f θ (x i ))∥ - 1 n n i=1 ∥φ(x i )∥ ≥ 1 n n i=1 ∥ W f θ (x i )∥ - 1 n n i=1 ∥ W f θ (x i ) -W f θ (x i )∥ - 1 n n i=1 ∥φ(x i )∥ = 1 n n i=1 ∥ W f θ (x i )∥ - 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ - 1 n n i=1 ∥φ(x i )∥ This implies that 1 n n i=1 ∥ W f θ (x i )∥ ≤ L S (w) + 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ + 1 n n i=1 ∥φ(x i )∥. Furthermore, since y = W * f θ (x) + φ(x), by writing ȳi = W * f θ (x i ) + φ(x i ) (where ȳi = y i since xi ∼ D yi for i = 1, . . . , n), 1 n n i=1 ∥ W f θ (x i )∥ = 1 n n i=1 ∥W f θ (x i ) -W * f θ (x i )∥ = 1 n n i=1 ∥W f θ (x i ) -ȳi + φ(x i )∥ ≥ 1 n n i=1 ∥W f θ (x i ) -ȳi ∥ - 1 n n i=1 ∥φ(x i )∥ Combining these, we have that 1 n n i=1 ∥W f θ (x i ) -ȳi ∥ ≤ L S (w) + 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ (32) + 1 n n i=1 ∥φ(x i )∥ + 1 n n i=1 ∥φ(x i )∥. To bound the left-hand side of equation 32, we now analyze the following random variable: E X,Y [∥W S f θ (X) -Y ∥] - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥, where ȳi = y i since xi ∼ D yi for i = 1, . . . , n. Importantly, this means that as W S depends on y i , W S depends on ȳi . Thus, the collection of random variables ∥W S f θ (x 1 ) -ȳ1 ∥, . . . , ∥W S f θ (n n )ȳn ∥ is not independent. Accordingly, we cannot apply standard concentration inequality to bound equation 33. A standard approach in learning theory is to first bound equation 33 by E x,y ∥W S f θ (x)- y∥ -1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ ≤ sup W ∈W E x,y ∥W f θ (x) -y∥ -1 n n i=1 ∥W f θ (x i ) -ȳi ∥ for some hypothesis space W (that is independent of S) and realize that the right-hand side now contains the collection of independent random variables ∥Wf θ (x 1 ) -ȳ1 ∥, . . . , ∥Wf θ (n n ) -ȳn ∥ , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space W. However, the complexity of the hypothesis space W can be very large, resulting into a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space W by using a very different approach with conditional expectations to take care the dependent random variables ∥W S f θ (x 1 ) -ȳ1 ∥, . . . , ∥W S f θ (n n ) -ȳn ∥. Intuitively, we utilize the fact that for these dependent random variables, there are a structure of conditional independence, conditioned on each y ∈ Y. We first write the expected loss as the sum of the conditional expected loss: E X,Y [∥W S f θ (X) -Y ∥] = y∈Y E X,Y [∥W S f θ (X) -Y ∥ | Y = y]P(Y = y) = y∈Y E Xy [∥W S f θ (X y ) -y∥]P(Y = y), where X y is the random variable for the conditional with Y = y. Using this, we decompose equation 33 into two terms: E X,Y [∥W S f θ (X) -Y ∥] - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ (34) =   y∈Y E Xy [∥W S f θ (X y ) -y∥] |I y | n - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥   + y∈Y E Xy [∥W S f θ (X y ) -y∥] P(Y = y) - |I y | n , where I y = {i ∈ [n] : y i = y}. The first term in the right-hand side of equation 34 is further simplified by using 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ = 1 n y∈Y i∈Iy ∥W S f θ (x i ) -y∥, as y∈Y E Xy [∥W S f θ (X y ) -y∥] |I y | n - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ = 1 n y∈ Ỹ |I y |   E Xy [∥W S f θ (X y ) -y∥] - 1 |I y | i∈Iy ∥W S f θ (x i ) -y∥   , where Ỹ = {y ∈ Y : |I y | ̸ = 0} . Substituting these into equation equation 34 yields E X,Y [∥W S f θ (X) -Y ∥] - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ (35) = 1 n y∈ Ỹ |I y |   E Xy [∥W S f θ (X y ) -y∥] - 1 |I y | i∈Iy ∥W S f θ (x i ) -y∥   + y∈Y E Xy [∥W S f θ (X y ) -y∥] P(Y = y) - |I y | n Importantly, while ∥W S f θ (x 1 ) -ȳ1 ∥, . . . , ∥W S f θ (x n ) -ȳn ∥ on the right-hand side of equation 35 are dependent random variables, ∥W S f θ (x 1 ) -y∥, . . . , ∥W S f θ (x n ) -y∥ are independent random variables since W S and xi are independent and y is fixed here. Thus, by using Hoeffding's inequality (Lemma 1), and taking union bounds over y ∈ Ỹ, we have that with probability at least 1 -δ, the following holds for all y ∈ Ỹ: E Xy [∥W S f θ (X y ) -y∥] - 1 |I y | i∈Iy ∥W S f θ (x i ) -y∥ ≤ κ S ln(| Ỹ|/δ) 2|I y | . This implies that with probability at least 1 -δ, 1 n y∈ Ỹ |I y |   E Xy [∥W S f θ (X y ) -y∥] - 1 |I y | i∈Iy ∥W S f θ (x i ) -y∥   ≤ κ S n y∈ Ỹ |I y | ln(| Ỹ|/δ) 2|I y | = κ S   y∈ Ỹ |I y | n   ln(| Ỹ|/δ) 2n . Substituting this bound into equation 35, we have that with probability at least 1 -δ, E X,Y [∥W S f θ (X) -Y ∥] - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ (36) ≤ κ S   y∈ Ỹ p(y)   ln(| Ỹ|/δ) 2n + y∈Y E Xy [∥W S f θ (X y ) -y∥] P(Y = y) - |I y | n where p(y) = |I y | n . Moreover, for the second term on the right-hand side of equation 36, by using Lemma 1 of (Kawaguchi et al., 2022) , we have that with probability at least 1 -δ, y∈Y E Xy [∥W S f θ (X y ) -y∥] P(Y = y) - |I y | n ≤   y∈Y p(y)E Xy [∥W S f θ (X y ) -y∥   2 ln(|Y|/δ) 2n ≤ κ S   y∈Y p(y)   2 ln(|Y|/δ) 2n where p(y) = P(Y = y). Substituting this bound into equation 36 with the union bound, we have that with probability at least 1 -δ, Combining equation 32 and equation 37 implies that with probability at least 1 -δ, E X,Y [∥W S f θ (X) -Y ∥] - 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ (37) ≤ κ S   y∈ Ỹ p(y)   ln(2| Ỹ|/δ) 2n + κ S   y∈Y p E X,Y [∥W S f θ (X) -Y ∥] (38) ≤ 1 n n i=1 ∥W S f θ (x i ) -ȳi ∥ + κ S 2 ln(2|Y|/δ) 2n y∈Y p(y) + p(y) ≤ L S (w S ) + 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ + 1 n n i=1 ∥φ(x i )∥ + 1 n n i=1 ∥φ(x i )∥ + κ S 2 ln(2|Y|/δ) 2n y∈Y p(y) + p(y) . We will now analyze the term 1 n n i=1 ∥φ(x i )∥ + 1 n n i=1 ∥φ(x i )∥ on the right-hand side of equa- tion 38. Since W * = W S , 1 n n i=1 ∥φ(x i )∥ = 1 n n i=1 ∥g * (x i ) -W S f θ (x i )∥. By using Hoeffding's inequality (Lemma 1), we have that for any δ > 0, with probability at least 1 -δ, 1 n n i=1 ∥φ(x i )∥ ≤ 1 n n i=1 ∥g * (x i ) -W S f θ (x i )∥ ≤ E x + [∥g * (x + ) -W S f θ (x + )∥] + κ S ln(1/δ) 2n . Moreover, by using (Mohri et al., 2012, Theorem 3.1) with the loss function x + → ∥g * (x + ) -W f (x + )∥ (i.e., Lemma 2), we have that for any δ > 0, with probability at least 1 -δ, E x + [∥g * (x + ) -W S f θ (x + )∥] ≤ 1 m m i=1 ∥g * (x + i ) -W S f θ (x + i )∥ + 2 Rm (W • F) √ m + κ ln(1/δ) 2m where Rm (W 1) as m → ∞ for typical choices of F), and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1, 1}. Takinng union bounds, we have that for any δ > 0, with probability at least 1 -δ, • F) = 1 √ m E S,ξ [sup W ∈W,f ∈F m i=1 ξ i ∥g * (x + i ) -Wf (x + i )∥] is the normalized Rademacher complexity of the set {x + → ∥g * (x + ) -Wf (x + )∥ : W ∈ W, f ∈ F} (it is normalized such that Rm (F) = O( 1 n n i=1 ∥φ(x i )∥ ≤ 1 m m i=1 ∥g * (x + i ) -W S f θ (x + i )∥ + 2 Rm (W • F) √ m + κ ln(2/δ) 2m + κ S ln(2/δ) Similarly, for any δ > 0, with probability at least 1 -δ, 1 n n i=1 ∥φ(x i )∥ ≤ 1 m m i=1 ∥g * (x + i ) -W S f θ (x + i )∥ + 2 Rm (W • F) √ m + κ ln(2/δ) 2m + κ S ln(2/δ) 2n . Thus, by taking union bounds, we have that for any δ > 0, with probability at least 1 -δ, 1 n n i=1 ∥φ(x i )∥ + 1 n n i=1 ∥φ(x i )∥ (40) ≤ 2 m m i=1 ∥g * (x + i ) -W S f θ (x + i )∥ + 4R m (W • F) √ m + 2κ ln(4/δ) 2m + 2κ S ln(4/δ) To analyze the first term on the right-hand side of equation 40, recall that W S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min W 1 m m i=1 ∥W f θ (x + i ) -g * (x + i )∥ 2 . ( ) Here, since W f θ (x + i ) ∈ R r , we have that W f θ (x + i ) = vec[W f θ (x + i )] = [f θ (x + i ) ⊤ ⊗ I r ] vec[W ] ∈ R r , where I r ∈ R r×r is the identity matrix, and [f θ (x + i ) ⊤ ⊗ I r ] ∈ R r×dr is the Kronecker product of the two matrices, and vec[W ] ∈ R dr is the vectorization of the matrix W ∈ R r×d . Thus, by defining A i = [f θ (x + i ) ⊤ ⊗ I r ] ∈ R r×dr and using the notation of w = vec[W ] and its inverse W = vec -1 [w] (i.e., the inverse of the vectorization from R r×d to R dr with a fixed ordering), we can rewrite equation 41 by W S = vec -1 [w S ] where w S = minimize w ′ ∥w ′ ∥ F s.t. w ′ ∈ arg min w m i=1 ∥g i -A i w∥ 2 , with g i = g * (x + i ) ∈ R r . Since the function w → m i=1 ∥g i -A i w∥ 2 is convex, a necessary and sufficient condition of the minimizer of this function is obtained by 0 = ∇ w m i=1 ∥g i -A i w∥ 2 = 2 m i=1 A ⊤ i (g i -A i w) ∈ R dr This implies that m i=1 A ⊤ i A i w = m i=1 A ⊤ i g i . In other words, A ⊤ Aw = A ⊤ g where A =     A 1 A 2 . . . A m     ∈ R mr×dr and g =     g 1 g 2 . . . g m     ∈ R mr Thus, w ′ ∈ arg min w m i=1 ∥g i -A i w∥ 2 = {(A ⊤ A) † A ⊤ g + v : v ∈ Null(A)} where (A ⊤ A) † is the Moore-Penrose inverse of the matrix A ⊤ A and Null(A) is the null space of the matrix A. Thus, the minimum norm solution is obtained by vec[W S ] = w S = (A ⊤ A) † A ⊤ g. Thus, by using this W S , we have that 1 m m i=1 ∥g * (x + i ) -W S f θ (x + i )∥ = 1 m m i=1 r k=1 ((g i -A i w S ) k ) 2 ≤ 1 m m i=1 r k=1 ((g i -A i w S ) k ) 2 = 1 √ m m i=1 r k=1 ((g i -A i w S ) k ) 2 = 1 √ m ∥g -Aw S ∥ 2 = 1 √ m ∥g -A(A ⊤ A) † A ⊤ g∥ 2 = 1 √ m ∥(I -A(A ⊤ A) † A ⊤ )g∥ 2 where the inequality follows from the Jensen's inequality and the concavity of the square root function. Thus, we have that 1 n n i=1 ∥φ(x i )∥ + 1 n n i=1 ∥φ(x i )∥ (42) ≤ 2 √ m ∥(I -A(A ⊤ A) † A ⊤ )g∥ 2 + 4R m (W • F) √ m + 2κ ln(4/δ) 2m + 2κ S ln(4/δ) By combining equation 38 and equation 42 with union bound, we have that E X,Y [∥W S f θ (X) -Y ∥] (43) ≤ L S (w S ) + 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ + 2 √ m ∥P A g∥ 2 + 4R m (W • F) √ m + 2κ ln(8/δ) 2m + 2κ S ln(8/δ) 2n + κ S 2 ln(4|Y|/δ) 2n y∈Y p(y) + p(y) . where W = W S -W * and P A = I -A(A ⊤ A) † A ⊤ . We will now analyze the second term on the right-hand side of equation 43: 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ ≤ ∥ W ∥ 2 1 n n i=1 ∥f θ (x i ) -f θ (x i )∥ , where ∥ W ∥ 2 is the spectral norm of W . Since xi shares the same label with x i as xi ∼ D yi (and x i ∼ D yi ), and because f θ is trained with the unlabeled data S, using Hoeffding's inequality (Lemma 1) implies that with probability at least 1 -δ, 1 n n i=1 ∥f θ (x i ) -f θ (x i )∥ ≤ E y∼ρ E x,x∼D 2 y [∥f θ (x) -f θ (x)∥] + τ S ln(1/δ) 2n . Moreover, by using (Mohri et al., 2012, Theorem 3.1) with the loss function (x, x) → ∥f θ (x)-f θ (x)∥ (i.e., Lemma 2), we have that with probability at least 1 -δ, E y∼ρ E x,x∼D 2 y [∥f θ (x) -f θ (x)∥] ≤ 1 m m i=1 ∥f θ (x + i ) -f θ (x ++ i )∥ + 2 Rm (F) √ m + τ ln(1/δ) 2m where 1) as m → ∞ for typical choices of F), and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1, 1}. Thus, taking union bound, we have that for any δ > 0, with probability at least 1 -δ, Rm (F) = 1 √ m E S,ξ [sup f ∈F m i=1 ξ i ∥f (x + i ) -f (x ++ i )∥] is the normalized Rademacher complexity of the set {(x + , x ++ ) → ∥f (x + ) -f (x ++ )∥ : f ∈ F} (it is normalized such that Rm (F) = O( 1 n n i=1 ∥ W (f θ (x i ) -f θ (x i ))∥ (47) ≤ ∥ W ∥ 2 1 m m i=1 ∥f θ (x + i ) -f θ (x ++ i )∥ + 2 Rm (F) √ m + τ ln(2/δ) 2m + +τ S ln(2/δ) 2n . By combining equation 43 and equation 47 using the union bound, we have that with probability at least 1 -δ,  E X,Y [∥W S f θ (X) -Y ∥] ((W S f θ (x i ) -y i ) k ) 2 ≤ 1 n n i=1 r k=1 ((W S f θ (x i ) -y i ) k ) 2 = 1 √ n ∥W S Z S -Y ⊤ ∥ F = 1 √ n ∥Y ⊤ (Z S ⊤ (Z S Z S ⊤ ) † Z S -I)∥ F = 1 √ n ∥(I -Z S ⊤ (Z S Z S ⊤ ) † Z S )Y ∥ F Thus, L S (w S ) = 1 √ n ∥P Z S Y ∥ F where P Z S = I -Z S ⊤ (Z S Z S ⊤ ) † Z S . By combining equation 48-equation 50 and using 1 ≤ √ 2, we have that with probability at least 1 -δ, E X,Y [∥W S f θ (X) -Y ∥] ≤ cI S (f θ ) + 2 √ m ∥P Z S Y S ∥ F + 1 √ n ∥P Z S Y S ∥ F + Q m,n , where  Q m,n = c

H KNOWN LEMMAS

We use the following well-known theorems as lemmas in our proof. We put these below for the completeness. These are classical results and not our results. Lemma 1. (Hoeffding's inequality) Let X 1 , ..., X n be independent random variables such that a ≤ X i ≤ b almost surely. Consider the average of these random variables, S n = 1 n (X 1 + • • • + X n ). Then, for all t > 0, It has been shown that generalization bounds can be obtained via Rademacher complexity (Bartlett & Mendelson, 2002; Mohri et al., 2012; Shalev-Shwartz & Ben-David, 2014) . The following is a trivial modification of (Mohri et al., 2012, Theorem 3.1) for a one-sided bound on the nonnegative general loss functions: Lemma 2. Let G be a set of functions with the codomain [0, M ]. Then, for any δ > 0, with probability at least 1 -δ over an i.i.d. draw of m samples S = (q i ) m i=1 , the following holds for all ψ ∈ G: E q [ψ(q)] ≤ 1 m m i=1 ψ(q i ) + 2R m (G) + M ln(1/δ) 2m , where R m (G) := E S,ξ [sup ψ∈G 1 m m i=1 ξ i ψ(q i )] and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1, 1}. Proof. Let S = (q i ) m i=1 and S ′ = (q ′ i ) m i=1 . Define φ(S) = sup ψ∈G E x,y [ψ(q)] - 1 m m i=1 ψ(q i ). ( ) To apply McDiarmid's inequality to φ(S), we compute an upper bound on |φ(S) -φ(S ′ )| where S and S ′ be two test datasets differing by exactly one point of an arbitrary index i 0 ; i.e., S i = S ′ i for all i ̸ = i 0 and S i0 ̸ = S ′ i0 . Then, where the fist line follows the definitions of each term, the second line uses the Jensen's inequality and the convexity of the supremum, and the third line follows that for each ξ i ∈ {-1, +1}, the distribution of each term ξ i (ℓ(f(x ′ i ), y ′ i ) -ℓ(f (x i ), y i )) is the distribution of (ℓ(f(x ′ i ), y ′ i ) -ℓ(f (x i ), y i )) since S and S ′ are drawn iid with the same distribution. The forth line uses the subadditivity of supremum. φ(S ′ ) -φ(S) ≤ sup ψ∈G ψ(q i0 ) -ψ(q ′ i0 ) m ≤ M m . (

I SIMCLR

In contrastive learning, different augmented views of the same image are attracted (positive pairs), while different augmented views are repelled (negative pairs). MoCo (He et al., 2020) and SimCLR (Chen et al., 2020) are recent examples of self-supervised visual representation learning that reduce the gap between self-supervised and fully-supervised learning. SimCLR applies randomized augmentations to an image to create two different views, x and y, and encodes both of them with a shared encoder, producing representations r x and r y . Both rx and r y are l2-normalized. The SimCLR version of the InfoNCE objective is: where η is a temperature term and K is the number of views in a minibatch



Figure 3: Evolution of the entropy for each of the learning rate configurations showing that the impact of picking the incorrect learning rate for the data and/or centroids lead to a collapse of the samples.

Define Z S = [f (x + 1 ), . . . , f (x + m )] ∈ R d×m . Then, we have A = [Z S ⊤ ⊗ I r ]. Thus,P A = I -[Z S ⊤ ⊗ I r ][Z S Z S ⊤ ⊗ I r ] † [Z S ⊗ I r ] = I -[Z S ⊤ (Z S Z S ⊤ ) † Z S ⊗ I r ] = [P Z S ⊗ I r ]whereP Z S = I m -Z S ⊤ (Z S Z S ⊤ ) † Z S ∈ R m×m . By defining Y S = [g * (x + 1 ), . . . , g * (x + m )] ⊤ ∈ R m×r , since g = vec[Y ⊤ S ], ∥P A g∥ 2 = ∥[P Z S ⊗ I r ] vec[Y ⊤ S ]∥ 2 = ∥ vec[Y ⊤ S P Z S ]∥ 2 = ∥P Z S Y S ∥ F (49)On the other hand, recall that W S is the minimum norm solution asW S = minimize W ′ ∥W ′ ∥ F s.t. W ′ ∈ arg min (x i ) -y i ∥ 2 .By solving this, we haveW S = Y ⊤ Z S ⊤ (Z S Z S ⊤ ) † ,where Z S = [f (x 1 ), . . . , f (x n )] ∈ R d×n and Y S = [y 1 , . . . , y n ] ⊤ ∈ R n×r . Then,

S E [S n ] -S n ≥ (b -a) ln(1/δ) 2n ≤ δ,andP S S n -E [S n ] ≥ (b -a)By using Hoeffding's inequality, we have that for all t > 0,P S (E [S n ] -S n ≥ t) ≤ exp -2nt 2 (b -a) 2 ,andP S (S n -E [S n ] ≥ t) ≤ exp -2nt 2 (b -a) 2 , Setting δ = exp -2nt 2(b-a) 2 and solving for t > 0,1/δ = exp 2nt 2 (b -a) 2 =⇒ ln(1/δ) = 2nt 2 (b -a) 2 =⇒ (b -a) 2 ln(1/δ) 2n = t 2 =⇒ t = (b -a) ln(1/δ) 2n

) Similarly, φ(S) -φ(S ′ ) ≤ M m . Thus, by McDiarmid's inequality, for any δ > 0, with probability at least 1 -δ, φ(S) ≤ E S [φ(S)] + M ln(1/δ) 2m . (55) (60)

Entropy estimator achieved better results on SSL -CIFAR10 accuracy on linear evaluation of SSL for different entropy estimators. The best results achieved by pairwise distances lower bound

