VARIANCE COVARIANCE REGULARIZATION ENFORCES PAIRWISE INDEPENDENCE IN SELF-SUPERVISED REP-RESENTATIONS

Abstract

Self-Supervised Learning (SSL) methods such as VICReg, Barlow Twins or W-MSE avoid collapse of their joint embedding architectures by constraining or regularizing the covariance matrix of their projector's output. This study highlights important properties of such strategy, which we coin Variance-Covariance regularization (VCReg). More precisely, we show that VCReg combined to a MLP projector enforces pairwise independence between the features of the learned representation. This result emerges by bridging VCReg applied on the projector's output to kernel independence criteria applied on the projector's input. This provides the first theoretical motivations and explanations of MLP projectors in SSL. We empirically validate our findings where (i) we put in evidence which projector's characteristics favor pairwise independence, (ii) we use these findings to obtain nontrivial performance gains for VICReg, (iii) we demonstrate that the scope of VCReg goes beyond SSL by using it to solve Independent Component Analysis. We hope that our findings will support the adoption of VCReg in SSL and beyond.

1. INTRODUCTION

Self-Supervised Learning (SSL) via joint embedding architectures has risen to learn visual representations outperforming their supervised counterpart. This paradigm enforces similar embeddings for two augmented versions of the same sample, thus allowing an encoder f to learn a representation for a given modality without labels. Importantly, f could solve the learning task by predicting the same embedding for every input, a failure mode known as collapse. To avoid this, various mechanisms have been proposed hence the diversity of SSL methods (e.g., Grill et al. (2020) ; Caron et al. (2020; 2021) ; Chen & He (2021) ). Most of these are a composition g • f of the encoder with a projector neural network g. Only f is retained after training, in opposition to supervised training that never introduces g. The projector was proposed by Chen et al. (2020) and significantly improved the quality of the learned representation in terms of test accuracy on ImageNet and other downstream tasks. Although some works (Appalaraju et al., 2020; Bordes et al., 2022) provide empirical knowledge on the projector, none provide a theoretical analysis of MLP projectors in practical SSL (Jing et al., 2021; Huang et al., 2021; Wang & Isola, 2020; HaoChen et al., 2021; Tian et al., 2020; Wang & Liu, 2021; Cosentino et al., 2022) . This study sheds a new light on the role of the projector via the lens of Variance-Covariance Regularization (VCReg), a strategy introduced in recent SSL methods (Bardes et al., 2022; Zbontar et al., 2021; Ermolov et al., 2021) to cope with collapse by constraining or regularizing the covariance or cross-correlation of the projector g output to be identity. More precisely, we demonstrate that VC regularization of the projector's output precisely enforces pairwise independence between the components of the projector's input i.e. the encoder's output, and connects this property to projector's characteristics such as width and depth. This provides the first theoretical motivations and explanations of MLP projector in SSL. Fully or partially pairwise independent representations are generally sought for, e.g to disentangle factors of variation (Li et al., 2019; Träuble et al., 2021) . Our experimental analysis suggests that different levels of pairwise independence of the features in the representation emerge from a variety of SSL criteria along with mutual independence. However, as opposed to other frameworks, VCReg allows for theoretical study and explicit control of the learned independence amount. We prove and experimentally validate this property for random projectors, study how it emerges in learned projectors, and use our results to obtain new and significant performance gains on VICReg over Bardes et al. (2022) . We then ablate the SSL context and lean on our findings to show that VCReg of a SSL projector solves Independent Component Analysis (ICA). Beyond providing a novel theoretical understanding of the projector, we believe that this work also leads to a better understanding of VICReg. The scope of VCReg is not limited to SSL: our experiments on ICA open the way to other applications where some degree of independence is needed.

Measuring

the independence between two sets of realizations {X 1 1 , . . . , X N 1 }, {X 1 2 , . . . , X N 2 }, X i 1 ∈ R M , X i 2 ∈ R M is a fundamental task that has a long history in statistics e.g. through the Mutual Information (MI) of the two random variables X 1 and X 2 from which those two sets are independently drawn from (Cover, 1999) . Those variables are said independent if the realization of one does not affect the probability distribution of the other. Computing the MI in practice is known to be challenging (Goebel et al., 2005) , which has led to considerable interest in using alternative criteria e.g. based on functions in Reproducing Kernel Hilbert Spaces (RKHS) (Bach & Jordan, 2002) , a special case of what is known as functional covariance or correlation (Rényi, 1959) . It consists in computing those statistics after nonlinear transformation (Leurgans et al., 1993) as in sup f1∈F1,f2∈F2 Corr⟨f 1 (X 1 ), f 2 (X 2 )⟩, where f 1 , f 2 are constrained to lie within some designed functional space, and the Cov can be used instead of the Corr. If Eq. ( 1) is small enough, then X 1 and X 2 are independent in regard to the functional spaces F 1 , F 2 . For example, if F 1 and F 2 are unit balls in their respective vector spaces, then Eq. ( 1) is just the norm of the usual correlation/covariance operator (Mourier, 1953) which would be enough for independence under joint Gaussian distributions (Melnick & Tenenbein, 1982) . HSIC and pairwise independence. More recently, Gretton et al. (2005a) introduced a pairwise independence criterion known as the Hilbert-Schmidt Independence Criterion (HSIC) which can be estimated given empirical samples X 1 and X 2 ∈ R N ×M as HSIC(X 1 , X 2 ) := HSIC(X 1 , X 2 ) = 1 (N -1) 2 Tr(K 1 HK 2 H), with H the centering matrix I -11 T 1 N , (K 1 ) i,j = k 1 (X i 1 , X j 1 ) and (K 2 ) i,j = k 2 (X i 2 , X j 2 ) the two kernel matrices of X 1 and X 2 respectively, and k 1 , k 2 of F 1 , F 2 universal kernels such as the Gaussian kernel (see Steinwart (2001); Micchelli et al. (2006) for other examples). Crucially, since HSIC (X 1 , X 2 ) ≥ sup f1∈F1,f2∈F2 Cov⟨f 1 (X 1 ), g(X 2 )⟩, HSIC can be used to test for (pairwise) independence as formalized below. Theorem 1 (Thm. 4 from (Gretton et al., 2005a) ). HSIC(X 1 , X 2 ) = 0 if and only if X 1 and X 2 are independent.

Gretton et al. (2005a

) also provide a statistical test for pairwise independence based on HSIC. Further quantities such as upper bounds on the MI can be found in a similar way, e.g. see Thm. 16 in Gretton et al. (2005b) . In our experiments, we will rely on HSIC under the Gaussian kernel scaled by the median of the distribution of pairwise euclidean distances between samples. dHSIC and mutual independence. Mutual independence of a set of D M -dimensional random variables X 1 , . . . , X D is a stronger property than independence between all pairs of random variables in the set. To evaluate it, Pfister et al. (2018) introduce dHSIC, a multivariate extension of HSIC. In short, dHSIC measures the distance between mean embeddings µ under a RKHS F (Smola et al., 2007) of the product of distributions and the joint distribution: dHSIC(X 1 , . . . , X D ) := ∥µ(P X1 ⊗ • • • ⊗ P X D ) -µ(P X1,...,X D )∥ F . Similarly to HSIC, Pfister et al. (2018) establish the equivalence between dHSIC = 0 and mutual independence along with a statistical test. We provide an estimator of dHSIC given empirical samples X 1 , . . . , X D of the above random variables in R N ×M each as well as implementations of HSIC and dHSIC in Appendix D.

Complexities.

In what follows, we consider the D features of a batch of representations X ∈ R N ×D as D scalar random variables (M = 1) with N samples each. The subsequent complexities of HSIC for one pair of features and dHSIC are O(N 2 ) and O(DN 2 ) respectively. Testing pairwise independence between all variables in a D-set with HSIC is O(D 2 N 2 ) while dHSIC requires N ≥ 2D (Pfister et al., 2018) . Since competitive visual representations typically have D = 2048, we resort to surrogates to estimate independence of the representations in practice. These surrogates are detailed in Section 5.

2.2. VARIANCE-COVARIANCE REGULARIZATION IN SELF-SUPERVISED LEARNING

SSL with joint embeddings learns visual representations by producing two different augmented views of an input batch of images S ∈ R N ×W ×H , denoted by S left and S right . Each view is fed to an encoder f , typically a ResNet50 (He et al., 2016) , producing representations X left and X right ∈ R N ×D which are passed through a projector g to output embeddings Z left and Z right ∈ R N ×P . Finally, an invariance term encouraging Z left and Z right to be similar is applied. While most SSL methods require architectural or training strategies to avoid collapse (Grill et al., 2020; He et al., 2021) , Bardes et al. (2022) and Zbontar et al. (2021) only require to modify the loss. After training, only f is retained to be used in downstream tasks. We will denote the (2N, P ) matrix Z total ≜ [Z T left , Z T right ] T . VICReg. In Bardes et al. (2022) , an anti-collapse term L VC , which we coin VC regularization (VCReg), is added to an invariance loss to form L VIC : L VC = P k=1 max 0, 1 -Cov(Z total ) k,k + α P j=1,j̸ =k Cov(Z total ) 2 k,j , L VIC = 1 N N n=1 ∥(Z left ) n,. -(Z right ) n,. ∥ 2 2 + L VC , The leftmost term in L VC corresponds to regularizing the variance of each feature in Z total to be at least unit, while the second term seeks to minimize the covariance between each pair of features in Z total . Note that L VC applies to each view separately. In Bardes et al. (2022) , the weight of each term in L VIC can be tuned. Since the authors find that best results are obtained with equal weights for the invariance and the variance terms, we only vary α. Note that Bardes et al. (2022) ; Zbontar et al. (2021) observe that wider projectors further improve the representation learned by the encoder f , yet neither further study this intriguing phenomenon. Although this work focuses on VCReg as formulated in VICReg, we provide background on Barlow Twins and W-MSE in the Appendix A, while next section establishes the similarity between VICReg, Barlow Twins and W-MSE as VCReg optimizers.

3. VC REGULARIZATION OF SSL PROJECTOR'S OUTPUT ENFORCES PAIRWISE INDEPENDENT FEATURES AT THE ENCODER'S OUTPUT

In this section, we demonstrate that for random Multi-Layer Perceptrons (MLPs) projectors, minimizing VC of the projector's output amounts to minimizing HSIC -a measure of pairwise dependence (see Section 2.1)-between all pairs of features in the learned representation, i.e the projector's input. We then justify how this reasoning extends to learned projectors. Our claims are experimentally verified in Section 5. Notations. In our setting, X ∈ R N ×D is the D dimensional output of the encoder f for a batch of size N . X is then fed to the projector g to form embeddings Z = g(X) = [g(X) 1 , . . . , g(X) P ] ∈ R N ×P on which SSL criteria are generally applied. Note that Z would be Z left or Z right in the previous subsection. The acronym MLP refers to projectors typically used in SSL, a neural network with three layers of same width, unless stated otherwise. Our proof strategy consists in proving that VCReg of each MLP layer (Linear + Batch Normalization (BN) + ReLU) output enforces pairwise independence of its input, before composing these results. We first study nonlinear elementwise projectors g : R → R L , which belong to the wider class of DeepSets (Zaheer et al., 2017) , and of which BN followed by ReLU can be seen as an instance. We denote the mapping of such projectors as Z = g(X) = [g(X :,1 ), . . . , g(X :,P )]. Lemma 1 (Nonlinear elementwise projectors minimize HSIC of their input). Let g : R → R L be a nonlinear elementwise projector; then, minimizing the covariance of Z with respect to the encoder f amounts to minimizing HSIC on all feature pairs in X with kernels K i = g((X) :,i )g((X) :,i ) T . Proof. Let us consider the N values of the i th data feature (X :,i ) as realizations of a random variable. Recalling Eq. ( 2), independence of two random variables can be estimated via HSIC. Considering the arbitrarily complicated network g and Z = [g(X :,1 ), . . . , g(X :,D )] ∈ R N ×DL , we have: HSIC((X) :,i , (X) :,j ) = 1 (N -1) 2 Tr(g((X) :,i )g((X) :,i ) T Hg((X) :,j )g((X) :,j ) T H) = 1 (N -1) 2 g((X) :,i ) T Hg((X) :,j ) 2 F = ∥Cov (g((X) :,i ), g((X) :,j )∥ 2 F = Cov (Z) 1+iL:1+(i+1)L,1+jl:1+(j+1)L 2 F =⇒ i̸ =j HSIC((X) :,i , (X) :,j ) = Cov (Z) ⊙ (1 -I D ) ⊗ 1 L 1 T L 2 F , and, in the case L = 1 we have 1 L 1 T L = 1 leading to i̸ =j HSIC((X) :,i , (X) :,j ) = i̸ =j Cov (Z) 2 i,j , concluding the proof. To rigorously obtain independence, the K i 's must be universal kernels. To satisfy this assumption, we may randomize Batch Normalization by drawing the mean and variance from some distribution. Combining this operation with a ReLU, we obtain random elementwise nonlinearities approaching random features (Rahimi & Recht, 2007) of a universal kernel (Sun et al., 2018) . Increasing L improves the approximation of such kernel (Chen & Phillips, 2017) i.e. the larger L, the better approximation of HSIC the covariance term in VICReg is; see Appendix D for the randomized Batch Normalization implementation, Figure 8 for its justification, and Section 5 for empirical validation. Remark 2 (Necessity of variance regularization). Although the variance regularization term does not explicitly appear when minimizing HSIC on all pairs, it is necessary when optimizing X to prevent the degenerate solution of X being a constant, a common collapse mode of SSL. Lemma 2 (Random linear projectors minimize HSIC of their input). Let g be a random linear projector with weights W , and X has same variance for each column. Then, for large projectors, minimizing the covariance of Z = g(X) = XW with respect to the encoder f amounts to minimizing HSIC with a linear kernel for each pair of features in X. (Proof in Appendix B.1.) Since the corresponding kernel in lemma 2 is linear, only decorrelation can be achieved for such projector's input. This however differs from PCA as we optimize over X and not g's parameters (W ). Proving lemma 2 requires a projector with orthogonal weights, i.e. W T W = I, which gets more and more accurate with random weights U( 1 / √ D, 1 / √ D) as P (the output dimension of g) increases. This follows since the central limit theorem states that the dot-product between two P -dimensional weight vectors tends to 0 with rate O( 1 / √ P ). Weight initialization in neural nets roughly follows U( 1 / √ D, 1 / √ D) , which will be used to implement random projectors. Theorem 3 (MLP projectors with random weights enforce pairwise independence.). Let us consider a MLP composed of alternating random linear layers and elementwise nonlinearities. Then, for large projectors, minimizing the variance and covariance of the output Z enforces pairwise independence between all pairs of features in the input X. Proof. Let us consider the last block of such MLP, i.e. a fully-connected linear layer followed by an elementwise nonlinearity. According to lemma 1, applying VCReg to the MLP, i.e. at the output of the nonlinearity will enforce pairwise independence under corresponding kernel for the input of the nonlinearity, which is also the output of the last linear layer. If the latter is wide enough so that it can be considered orthogonal, Theorem 11 in Comon (1994) ensures that pairwise independence is preserved for the input of the layer. We can then recursively extend the result backward to the whole MLP. If the last MLP layer is a fully-connected linear, then lemma 2 applies and we go back to the preceding elementwise nonlinearity. Fig. 6 in Appendix C .1 shows that each layer in the random MLP is recursively VC-regularized from VCReg being applied only at the projector's output. Following Theorem 3, we can expect that wider projectors are beneficial to enforce pairwise independence while adding layers or learning the projector is not necessary; see Section 5 for empirical validation. Extension to BarlowTwins and W-MSE, and generality of VCReg. Our results focus on VCReg as formulated in VICReg but can in fact be extended to methods that constrain the covariance of Z explicitly, namely BarlowTwins and W-MSE (Balestriero & LeCun, 2022) . Indeed, the objective in W-MSE (Equations 3-4 in Ermolov et al. (2021) ) is VICReg with explicit constraint on the variance and covariance. Increasing the variance and covariance hyper-parameters in VICReg produces W-MSE hence our results extend seamlessly. The objective from Eq. ( 7) in BarlowTwins is also similar to VICReg. The derivation, deferred to Appendix B.2, shows that minimizing the constrained form of BarlowTwins objective from Eq. ( 7) is equivalent to minimizing VICReg's invariance term whilst explicitly constraining the variance covariance terms as in W-MSE; hence our results also hold for BarlowTwins; see Section 5 for empirical validation. In particular we will see that as BarlowTwins explicitly enforces minimum VCReg, it better optimizes HSIC compared to VICReg with standard hyper-parameters. Finally, as opposed to BarlowTwins loss and most SSL methods, VCReg can be used and be beneficial within single branch architectures. We provide such use case in Section 5. Learned projectors. In state-of-the-art SSL representations, the projector is learned, which is not rigorously covered by Theorem 3. Complementing the study of Bordes et al. (2022) , we argue that learning the projector is only crucial to satisfy the invariance criterion since random projectors are sufficient to obtain pairwise independent features, as demonstrated in Section 5. In fact, our experiments show that i) using VICReg, keeping the projector random generally produces lower HSIC of their input i.e. better enforces pairwise independence than learning the projector's weights, ii) using VICReg, learning the projector to optimize VCReg more strongly than the invariance term reduces performances, and iii) using VCReg, learning the projector, e.g. in the later studied ICA setting, creates a degenerate representation that does not enforce pairwise independence. We thus conjecture that learning the projector's parameters to mainly minimize VICReg's invariance term leaves the parameters close enough to their random initialization (Jacot et al., 2018) to maintain an accurate estimate of HSIC. In fact, we will see that the wider the projector, the less far away from initialization the parameters have to move, and the better optimized HSIC.

4. RELATED WORK

Feature decorrelation. Also known as whitening, feature decorrelation ensures that correlation between each pair of different features in a batch of feature vectors is zero, and that each feature has unit variance. It was originally used as a data pre-processing technique, see e.g. Hyvärinen & Oja (2000) , before being extended to deep networks (Cogswell et al., 2015) as a regularizer. In the context of SSL, Hua et al. (2021) ; Ermolov et al. (2021) find that feature decorrelation helps solving the collapse issue. The former avoids collapse via appropriate Batch Normalization and its decorrelating variant (Huang et al., 2018) in the projector. This variant of Batch Normalization can be seen as the hard constraint counterpart of BarlowTwins. Practically, whitening can be implemented by learning a fully-connected layer (Husain & Bober, 2019) recovering Principal Component Analysis; as mentioned earlier (Section 3), this differs from VCReg which keeps the layer's parameters random and optimizes its input. Independence criterion for learning features. Enforcing mutual independence to learn a representation has been proposed for example by Schmidhuber (1992) . More recently, and in the context of supervised learning, Chen et al. (2019) demonstrated improved training of ResNets by reducing the pairwise dependence in the features at each layer via a combination of Dropout and Batch Normalization. By opposition, VCReg is applied to the projector's output and common SSL projectors do not rely on Dropout; we also show that Batch Normalization is not necessary to reduce pairwise dependence. HSIC in supervised and self-supervised learning. HSIC-based losses have been employed in supervised learning e.g. see Mooij et al. (2009) ; Greenfeld & Shalit (2020) to ensure independence between the residual errors and the labels of a task at hand. Kornblith et al. (2019) measure similarity between two neural representations via HSIC. In SSL, Tsai et al. (2021) showed that a modified version of BarlowTwins loss maximizes HSIC under a linear kernel between the embeddings of two augmented versions of the same sample. In a similar fashion, Li et al. (2021) proposed a SSL framework based on maximizing the dependence between the embeddings of transformations of an image and the image identity via HSIC. Both works differ from ours, which connects VCReg to the minimization of HSIC between pairs of features in the representation.

5. EXPERIMENTS

We first put in evidence pairwise independence properties emerging in learned visual representations, before validating and exploiting our theoretical findings. Finally, we ablate the SSL context and demonstrate that the composition of a SSL-like random projector with VCReg induces enough independence to perform ICA. Importantly, none of the experiments with VCReg use Dropout, and Batch Normalization is not used when the projector is random. Hence, pairwise independence cannot be attributed to those two techniques as opposed to Chen et al. (2019) ; Hua et al. (2021) .

5.1. PAIRWISE INDEPENDENCE EMERGES IN MOST VISUAL REPRESENTATIONS

Setup. In these experiments, we track two metrics for statistical independence of the components in the learned representation during ResNet50 training on ImageNet with popular SSL frameworks, as well as a supervised baseline. The first metric, HSIC (Gretton et al., 2005a) , tracks pairwise independence. The second metric, dHSIC (Pfister et al., 2018) , tracks mutual independence. As explained in Section 2.1, neither HSIC nor dHSIC scale to the full representation (2048 components). Therefore, we instead rely on proxys: for pairwise independence, we compute and average HSIC on all pairs of the first n components of the representation. For mutual independence, we sample sets of n components for which we display dHSIC. We set n = 10 and compute these statistics on the whole ImageNet validation set. We expect both metrics to decrease during the training, which would suggest decreasing dependence among the representations. Although statistical tests are available for both metrics, we track bare HSIC and dHSIC as they are continuous. However, we perform HSIC tests in this first serie of experiments so that the bare values can be linked to concrete portions of independent pairs. The width of the Gaussian kernel in HSIC is scaled for each representation (see 2.1) to allow for comparison between methods. See Appendix E for the detailed setup.

Results

. Figure 1 shows that different learning frameworks implicitly optimize pairwise independence of the features in the representation throughout the training (left), and at some point mutual independence (right). As expected from Section 3, HSIC is continuously optimized in Barlow Twins and VICReg (it increases for DINO)foot_0 , and Barlow Twins better enforces pairwise independence since the covariance matrix is constrained. Precise values along with HSIC independence tests at level α = 0.5 and corresponding test accuracies can be found in Table 1 . Lower HSIC values do not necessarily entail more successful tests: it is possible to allocate low HSIC to sufficiently many pairs while having an overall larger HSIC. For example, Supervised has less independent pairs in spite of an overall smaller HSIC. Table 1 also suggests that pairwise independence is not predictive of test accuracy. Although it seems to be a desirable property, it is not sufficient to yield a good representation: for example, DINO achieves better test accuracy than BarlowTwins while having higher HSIC. We study this trade-off in the next subsection. Although we observe decreasing dHSIC, i.e. improvement of mutual independence in the representations, all methods quickly reach a floor and do not improve further although it is lower for Barlow Twins and VICReg. Hence, we cannot conclude that this property is properly optimized and do not study it in subsection 5.2, but conduct experiments to better evaluate mutual independence obtained by VCReg in subsection 5.3.

5.2. PROJECTOR CHARACTERISTICS FOSTERING OR HURTING PAIRWISE INDEPENDENCE

Setup. These experiments validate the results from Section 3 by studying which projector characteristics help or hurt HSIC optimization when learning visual representations. The setup is the same as above except that dHSIC is not used anymore and the projector may be replaced by projectors introduced in Section 3: an elementwise nonlinear projector from lemma 1, random linear projectors of various width from lemma 2, and a MLP taking the representation (D = 2048) as input, and outputting an embedding of size P , with two hidden layers of width P and ReLUs. The MLP can be random as in Theorem 3 or learned as in Section 5.1. We apply an invariance loss along with VCReg on the output of the projectors and scale the covariance coefficient with the size of the projector for fair comparison. See Appendix E for the detailed setup. Results. Figure 2 (top left) demonstrates that nonlinear elementwise projectors, random linear projectors and random MLPs with one layer achieve lower HSIC than learned MLPs, in particular when resampled so as to get better HSIC estimates. Hence, learning the projector is not necessary to obtain pairwise independence. As expected in Section 3, for both theoretical and classical projectors, increasing width yields lower HSIC while increasing depth hurts HSIC, see Figure 2 . Table 5 in Appendix C.1 shows that increasing the weight of VCReg does not improve HSIC: a possible explanation is that the projector would otherwise start to optimize too much for VCReg. This intuition will be backed by Experiments 5.3. Modifying Batch Normalization to get closer to a universal kernel in lemma 1 yields significant test accuracy gains for VICReg over Bardes et al. (2022) (Figure 2 , top right). Generally, low HSIC is beneficial provided that the test accuracy on ImageNet is reasonable: since HSIC does not guarantee that information is retained from the data, optimizing for it only results in performance loss, see inflexions in Figure 2 (top right) and Figure 5 . We conjecture that HSIC combined to e.g. test accuracy on ImageNet could be used for model selection. Discussion. As expected from Section 3, Batch Normalization in the projector is not required to enforce pairwise independence: Figure 2 (top right) shows that removing BN improves HSIC while maintaining test accuracy, which could be explained by the fact that each activation in the projector is implicitly regularized by VCReg (Figure 6 ), alleviating the need for BN. This could also explain why adding layers is detrimental to HSIC: implicit VCReg of the activations, an assumption of Theorem 3, will be more and more loosely enforced. Although projectors from Section 3 learn representations with better pairwise independence properties than classical SSL methods, most of them are not as competitive in terms of test accuracy (Fig. 2 , top right). This complements the view of Appalaraju et al. (2020) ; Bordes et al. (2022) , which argue that the projector requires some learning capacity to filter out information that is irrelevant to the invariance criterion. Our experiments suggest that the projector capacity should rather be increased via width than via depth: adding layers can be detrimental to HSIC as seen above but also to test accuracy (Appalaraju et al., 2020; Chen et al., 2021) .

5.3. ABLATION: INDEPENDENT COMPONENT ANALYSIS (ICA) WITH VCREG

The following experiments aim to demonstrate that our findings hold outside of SSL. To this end, we show that VCReg of a SSL-like projector's output induces enough pairwise independence to solve VC regularized projectors solve linear ICA. In this setting, finding M enforcing pairwise independence in the components of Y M is generally sufficient to recover S (Comon (1994), Theorem 11). VCReg of a random projector's output should therefore be able to recover S. Our model can be seen on Figure 3 : whitened batches of mixtures Y are fed to an encoder f . Here, f is a linear transformation M described above. The output X = f (Y ) is then fed to a projector g, and VCReg is applied to the output Z = g(X) covariance matrix. As in Experiments 5.2, the projector is randomly resampled at each gradient step to get better HSIC estimates. This setting corresponds to one branch of a VICReg network i) with a linear projection M encoder instead of a neural network, ii) with a random projector, and iii) without invariance criterion. We perform ICA on two datasets, a synthetic one (Brakel & Bengio, 2017) with 6 sources among which 2 are noise, and a real audio dataset (Kabal, 2002) with 3 sources among which one is noise, also used in Brakel & Bengio (2017) . The evaluation metric is the maximum correlation between the true and reconstructed sources. As this metric is not available without ground truth, we select the model with lowest dHSIC for X, which can be perfectly evaluated here since the number of sources is small. Our model recovers the sources, and is even competitive with methods specifically designed for linear ICA such as Fast ICA (Hyvärinen & Oja, 2000) or Anica (Brakel & Bengio, 2017) , see Table 2 . Both increasing the width of the projector and resampling it at each step (especially for smaller projectors) improve the reconstruction as can be seen in Figure 2 (bottom right). This is in line with our findings in Section 3 and Experiment 5.2. VC regularized projectors do not solve nonlinear ICA. Experiment 5.1 suggests that mutual independence also improves during training, although not as clearly as pairwise independence. Hence, one could ask whether VCReg also enforces the former enough to solve nonlinear ICA. To test this hypothesis, we apply VCReg to a particular case of nonlinear ICA which allows identifiability but does not have equivalence between pairwise and mutual independence: the post-nonlinear mixture (PNL) (Taleb & Jutten, 1999) . In PNL, the sources are linearly mixed before being fed to elementwise nonlinear functions. For these experiments, and following Brakel & Bengio (2017) , our encoder is a MLP. During our first experiments, we observed an informational collapse of the encoder, which produces seemingly mutually independent variables with very poor reconstruction of the sources. To alleviate this issue, we add a decoder taking X as input and reconstructing Y . Figure 9 in Appendix E shows our modified setup. We compare VCReg to FastICA and Anica. Although FastICA is not meant to solve the nonlinear case, it remains an interesting baseline. Table 2 shows that our model fails to recover the sources, as it does only slightly better than FastICA in the synthetic case. Hence, although mutual independence increases during training, VCReg does not optimize it enough to solve nonlinear ICA. We propose a simple explanation for this limitation. Indeed, each feature in Z is a nonlinear function of all features in X. Hence, it is not possible to completely decorrelate two components of Z as they both contain the same set of features. It is still possible to improve mutual independence to some extent since, in practice, only parts of the inputs are considered at once by nonlinear mappings such as neural networks (Erhan et al., 2009; Adebayo et al., 2018) . Learning the projector does not enforce pairwise independence. ICA experiments done with a learned projector fail to recover independent sources: this further demonstrates that learning the projector to specifically optimize the VCReg is counter-productive, and that in the context of SSL, learning the projector is rather useful for satisfying the invariance criterion, which is absent in the ICA experiments. VCReg is rather optimized by the encoder f .

6. CONCLUSION: WHAT IS THE INTEREST OF PAIRWISE INDEPENDENCE?

This work claims that SSL projectors enforce pairwise independence. But, is it a desirable property in a representation? For example, Wang & Jordan (2021) claim that mutual independence can be detrimental to learn disentangled representations. Pairwise independence may nevertheless be beneficial for representations to be linearly probed: Table 3 suggests that given two representations with similar test accuracy on ImageNet, the one with significantly lower HSIC performs better on some downstream tasks on linear evaluation, a common requirement for SSL representations. Showing the projector to be responsible for it, and demonstrating HSIC to be useful for model selection would be important results we leave for future work. Finally, this work focuses on random weights yet opens the way to weight distributions that are closer to training parameters in practice which, for example, have been characterized for overparameterized networks (Jacot et al., 2018) . Implementations of the algorithms used (HSIC, dHSIC, Randomized Batch Normalization, Linear ICA, and PNL ICA) are provided in Appendix D, and our code will be released if the paper is accepted. Our experimental setup is detailed either in the main paper or in Appendix E. The latter include architectural details, augmentations, optimizer with learning rate and batch size, and grid search for hyper-parameters. The numbers used to produce the Figures in 5 are provided in C.1. A FURTHER BACKGROUND

A.1 VCREG METHODS

Learning visual representations has a long history in machine learning. Shortly after the advent of convolutional neural networks (CNN), it was common to train a model on a supervised task such as ImageNet before removing the classifier and use the remaining model to produce features for downstream tasks (e.g., classification, segmentation), a technique coined transfer learning. Then, strategies to learn representations without labelled dataset by enforcing invariant embeddings emerged (Misra & Maaten, 2020; Chen et al., 2020) . Joint embedding methods can be divided in two categories. Contrastive learning (e.g., SimCLR (Chen et al., 2020) ) pulls together representations of two augmented versions of the same image while pushing away the representation of this image from the representation of different images. Non-contrastive learning (e.g., DINO (Caron et al., 2021) ) pulls together representations of two augmented versions of the same image while avoiding collapse using different techniques. The frameworks considered in this work belong to the latter category. Non-contrastive self-supervised representation learning of images has a few peculiarities: • Data augmentations are central, at least when it comes to learn from ImageNet, and must be carefully chosen. • Most methods can be used either with Vision Transformers (Dosovitskiy et al., 2020) , or with CNNs such as ResNets (He et al., 2016) . In this work, we focus on ResNets. • Model selection is usually performed via the Top1 accuracy on the validation set of ImageNet using a linear classifier on top of the learned representation. Outside of joint embedding, methods such as Masked-Auto-Encoder (He et al., 2021) perform poorly in linear evaluation while delivering excellent performance when fine-tuned on the downstream task. The crucial role of the projector. It was found by Chen et al. (2020) that adding a MLP on top of the encoder (removed after the training) significantly improved the quality of the learned representation. For example, in 100 epochs, VICReg without projector would achieve 48% validation accuracy on ImageNet instead of 68 and SimCLR 50% instead of 68%. Since then, a few theoretical work attempted to explain SSL with joint embeddings. Jing et al. (2021) study the role of linear projectors with restricted augmentations, while other works such as (Huang et al., 2021; Wang & Isola, 2020; HaoChen et al., 2021; Tian et al., 2020; Wang & Liu, 2021) only consider an encoder without projector in their theoretical analysis. Finally, Cosentino et al. (2022) consider a MLP projector in a very restricted context where data augmentations are Lie group transformations. BarlowTwins. Zbontar et al. (2021) propose a slightly different approach based on regularizing C, the P × P cross-correlation matrix between Z left and Z right by optimizing L BT = K k=1 ((C) k,k -1) 2 + α k ′ ̸ =k (C) 2 k,k ′ . Here, the leftmost term corresponds to regularizing the cross-correlation of the same feature in the two views to be unit, while the rightmost term regularizes the cross-correlation between pairs of different features in the two views. Importantly, (C) i,j falls back to measuring the cosine similarity between the i th column of Z left and the j th column of Z right i.e. (C) i,j = ⟨(Z left ).,i,(Z right ).,j ⟩ ∥(Z left ).,i∥2∥(Z right ).,j ∥2 . Hence, the leftmost term is also an invariance term: all features must be similar for both views.

W-MSE. Ermolov et al. (2021) use the following loss

min 2 -2 ⟨Z left , Z right ⟩ ∥Z left ∥ 2 ∥Z right ∥ 2 s.t. Cov(Z left ) = Id, Cov(Z right ) = Id. where the cosine similarity can be replaced by the Euclidean distance.

A.2 INDEPENDENT COMPONENT ANALYSIS

The goal of ICA (Comon, 1994) is to find a transformation of a random vector Yfoot_2 which minimizes the statistical dependence of its components. Its simplest instance is linear ICA, and is motivated by problems such as the cocktail-party, where Y ∈ R N ×D typically results from a linear transformation of independent sources S (e.g overlapping voices) one wants to recover. Formally, Y = SA, with A ∈ R D×D an unknown mixing matrix that can therefore not be inverted. Instead, the linear ICA approach searches for M ∈ R D×D such that Y M = S by maximizing the statistical mutual independence between the components of Y M . Being able to recover S is a property known as identifiability. Proof. Let's consider the linear regime for which g((X) :,i ) = (X) :,i w T i with orthogonal weights i.e. ⟨w i , w j ⟩ = 1 {i=j} . This assumption is realistic since we assume that w ∈ R K with K very large, and that those are randomly sampled. From that, we then have i̸ =j HSIC((X) :,i , (X) :,j ) = 1 (N -1) 2 i̸ =j Tr(g((X) :,i )g((X) :,i ) T Hg((X) :,j )g((X) :,j

B PROOFS

) T H) = 1 (N -1) 2 i̸ =j ∥g((X) :,i ) T Hg((X) :,j )∥ 2 F , we will now push the sum inside the norm by considering the following equality: ∥ i̸ =j f (i, j)∥ 2 F = i̸ =j ∥f (i, j)∥ 2 F + i̸ =j k̸ =ℓ,(i,j)̸ =(k,ℓ) Tr f (i, j) T f (k, ℓ) = i̸ =j ∥f (i, j)∥ 2 F , since in our case f (i, j) = g((X) :,i ) T Hg((X) :,j ) = w i (X) T :,i H(X) :,j w T j , f (i, j) T f (k, ℓ) = w j (X) T :,j H(X) :,i w T i w k (X) T :,k H(X) :,ℓ w T ℓ , Tr(f (i, j) T f (k, ℓ)) = 1 {i=k∧j=ℓ} (X) T :,j H(X) :,i (X) T :,k H(X) :,ℓ , leading to i̸ =j HSIC((X) :,i , (X) : ,j ) = 1 (N -1) 2 ∥ i̸ =j g((X) :,i ) T Hg((X) :,j )∥ 2 F = 1 (N -1) 2 ∥( i g((X) :,i )) T H( j g((X) :,j )) - i g((X) :,i ) T Hg((X) :,i )∥ 2 F =∥ 1 N -1 ( i g((X) :,i )) T H( j g((X) :,j )) -diag(Cov(XW ))∥ 2 F =∥Cov(XW ) -diag(Cov(XW ))∥ 2 F = i̸ =j Cov(XW ) 2 i,j , since we assume that all columns of X have same variance and that W is orthogonal we have for the pre-last equality 1 N -1 i g((X) :,i ) T Hg((X) :,i ) = Var((X) :,1 ) i w i w T i = Var((X) :,1 ), and for the last equality we use the fact that since g is linear, i g((X) :,i ) = i (X) :,i w T i = XW with W = [w 1 , . . . , w K ] T .

B.2 ON THE EQUIVALENCE BETWEEN VICREG AND BARLOWTWINS OBJECTIVES

We can express Barlow Twins objective as: min K k=1 (Cov(Z left , Z right ) k,k -1) 2 + α k ′ ̸ =k Cov(Z left , Z right ) 2 k,k ′ s.t. Cov(Z left ) = I, Cov(Z right ) = I. Assuming Z T left Z left = I, Z T right Z right = I i.e. perfect minimization of the variance and covariance terms, we have C i,j = ⟨(Z left ) :,i , (Z left ) :,j ⟩ ∥(Z left ) :,i ∥ 2 ∥(Z left ) :,j ∥ 2 = 1 2 ⟨(Z left ) :,i , (Z right ) :,j ⟩ = -∥(Z left ) :,i -(Z right ) :,j ∥ 2 2 -1, and thus i (C i,i -1) 2 = i ∥(Z left ) :,i -(Z right ) :,i ∥ 4 2 = ∥Z left -Z right ∥ 4 F ∝ I(Z left , Z right ), so we recover the invariance loss exactly with the diagonal terms of BarlowTwins. We now show this is actually enough to minimize this quantity to also minimize the off-diagonal terms: i̸ =j C 2 i,j = i̸ =j (1 - 1 2 ∥(Z left ) :,i -(Z right ) :,j ∥ 2 2 ) 2 = i̸ =j (1 - 1 2 ∥(Z left ) :,i -(Z left ) :,j + (Z left ) :,j -(Z right ) :,j ∥ 2 2 ) 2 = i̸ =j 1 - 1 2 ∥(Z left ) :,i -(Z left ) :,j ∥ 2 2 - 1 2 ∥(Z left ) :,j -(Z right ) :,j ∥ 2 2 -⟨(Z left ) :,i -(Z left ) :,j , (Z left ) :,j -(Z right ) :,j ⟩ 2 = i̸ =j - 1 2 ∥(Z left ) :,j -(Z right ) :,j ∥ 2 2 -⟨(Z left ) :,i -(Z left ) :,j , (Z left ) :,j -(Z right ) :,j ⟩ 2 = i̸ =j 1 2 ∥(Z left ) :,j -(Z right ) :,j ∥ 2 2 + ⟨(Z left ) :,i -(Z left ) :,j , (Z left ) :,j -(Z right ) :,j ⟩ 2 ≤ i̸ =j 1 2 ∥(Z left ) :,j -(Z right ) :,j ∥ 2 2 + ∥(Z left ) :,i -(Z left ) :,j ∥ 2 2 ∥(Z left ) :,j -(Z right ) :,j ∥ 2 2 2 = 25(K -1) 4 ∥Z left -Z right ∥ 4 2 , hence minimizing the BarlowTwins loss with the explicit whitening constraint is equivalent to minimizing VICReg with explicit whitening constraint. Figure 5 : Minimizing HSIC is beneficial up to some extent, since HSIC does not guarantee that useful information is present in the representation. Beyond this limit, the performance decreases on ImageNet and on downstream tasks. 

E EXPERIMENTAL DETAILS E.1 DETAILS ON IMAGENET EXPERIMENTS

Architectures and hyper-parameters. We use a Resnet50 bottleneck, optimized during 100 epochs with LARS and an initial learning rate of 0.3 for all methods. The batch size is 1024 for all methods except SimCLR. All methods have projector 8192 -8192 -8192 except DINO and Supervised. Throughout the training, we evaluate the learned representation using an online linear classifier. Details specific to each method: • Barlow Twins: for Experiment 5.1, the off-diagonal coefficient is 0.0051. • VICReg: for Experiment 5.1, the projector size is 8192 -8192 -8192, Invariance coeff is 25, Variance coeff is 25 and is Covariance coeff 1. For the rest of the experiments, and following Bardes et al. (2022) , we only move the covariance coefficient when changing the size of the projector. We scale it in the square root of the output size of the projector in order to have similar magnitude for the covariance term for all projector sizes. • SimCLR: for Experiment 5.1, we choose a higher batch size of 2048 as SimCLR is sensitive to this parameter (Chen et al., 2020) . The temperature is 0.15. • DINO: we use 8 crops. The head has 4 layers of widths 2048 -2048 -2048 -256 -65536. Augmentations. At train time, the resolution is 160 and we use • RandomHorizontalFlip(). • ColorJitter(0.8, 0.4, 0.4, 0.2, 0.1). • Greyscale(0.2). • NormalizeImage with ImageNet mean and standard deviation. • GaussianBlur() with kernel size (5, 9) and sigma (0.1, 2). At validation time, the resolution is 224 and the images are cropped and normalized with ImageNet mean and standard deviation.

E.2 ICA SETUP

Detailed setup. In these experiments, we keep the same optimizer, train on 100 epochs and choose a batch size of 64 for both datasets. We tune the learning rate according to a logarithmic grid [1.0, 10.0, 100.0] (then rescaled according to lr×batchsize

256

). The MLP can either be a SSL-like projector or simply a fully-connected layer followed by a ReLU. The wider the MLP, the better the result hence it does not require selection. For the synthetic dataset, the MLP has width 1024, and 8192 for the audio one. • Linear ICA: We tune the standard and covariance coefficients according to a logarithmic grid [1, 10, 100]. • PNL ICA: Our architecture for the nonlinear ICA experiments is presented in Figure 9 . The encoder f is a MLP with 3 layers and width 128. The decoder h is a learnable nonlinearity (a MLP with 3 hidden layers and depth 16) followed by a fully-connected layer. We tune the standard, covariance and reconstruction coefficients according to a larger logarithmic grid [1, 3, 10, 30, 100] . 



HSIC is continuously optimized for SimCLR as well: this could be explained in light ofGarrido et al. (2022), which connect VICReg and SimCLR, and is omitted here for conciseness. As opposed to other methods here, DINO relies on multicrop, which partly explains the performance gap. For simplicity, Y directly denotes N empirical D-dimensional observations.



Figure 1: Overall, HSIC and dHSIC of the encoder output improves for all methods during training. VCReg based methods have smoother and/or further optimization of HSIC and dHSIC.

Figure 2: Top left: wider projectors and resampling both yield more pairwise independence for the encoder output. Bottom left: increasing depth hurts HSIC for random or learned projectors. Top right: trade-off between independence of the representation and test accuracy (width D = 8192). Bottom right: ICA experiments. Resampling and increased width improve the quality of the reconstruction.

Figure 4: The projectors proposed in Section 3 better optimize HSIC than their classical counterpart.

Figure6: The covariance matrices for some features of each activation in a random VICReg projector (same as inBardes et al. (2022) with width 8192) before corresponding hidden layer are close to diagonal, suggesting that each hidden layer output is implicitly VC-regularized.

Figure 8: Distributions of µ ℓ,k (top row) and σ ℓ,k (bottom row), which are the BN centering and scaling statistics computed at each mini-batch for three units in a DN layer with W ℓ z ℓ-1 ∼ N ([1, 0, -1], diag([1, 3, 0.1])) (ℓ is not relevant in this context). The empirical distributions in black are obtained by repeatedly sampling mini-batches of size 64 from a training set of size 1000. The analytical distributions in green can be obtained by analysis, and its empirical distributions in red closely match the training one.

Figure 9: Nonlinear ICA model.

Figure 10: Data before mixing.

Figure12: This level of max correlation is typically achieved by FastICA, VCReg or Anica.

Measure of pairwise dependence (HSIC), pairwise independence testing, and test accuracy for popular SSL or supervised representations averaged on multiple subsets of features.

Maximum correlation (between true and reconstructed sources (the higher the better ↑). The projector performs competitive reconstruction in the linear setting (left), where pairwise independence is sufficient, but not in PNL (right) where it is not.

Test accuracy of SSL representations on common downstream tasks (linear evaluation). theoretical results, explanations on assumptions are provided in Section 3 and proofs are displayed either in the main paper or in Appendix B.1. The results are verified experimentally in Section 5.

Measure of pairwise dependence (the lower the better) and test accuracy for the projectors studied in Section 3. L is the number of nonlinear projections in Lemma 1, D is the width of the projector and l is the number of layers.

VICReg with 2048VICReg with  -2048VICReg with  -2048   projector and varying covariance coefficient. Increasing this coefficient does not necessarily decreases HSIC. Indeed, the projector may start to optimize too much for the VC criterion thus getting trapped in bad solutions.

VICReg with learned MLP projector and varying sizes. The wider the projector, the lower HSIC, which conforms to our theoretical results in Section 3.

annex

In our study, the bandwidth σ of the Gaussian kernel is determined by the median of the distribution of pairwise euclidean distances between samples. D.2 DHSIC EQ. (4) dHSIC can be estimated given empirical samples X 1 , . . . , X d with respective kernel matrices K 1 , . . . , K d as:When d = 2, the first term corresponds to a biased HSIC estimator. The implementation of dHSIC is given by: 1 import GaussianKernelMatrix # defined above 

