VARIANCE COVARIANCE REGULARIZATION ENFORCES PAIRWISE INDEPENDENCE IN SELF-SUPERVISED REP-RESENTATIONS

Abstract

Self-Supervised Learning (SSL) methods such as VICReg, Barlow Twins or W-MSE avoid collapse of their joint embedding architectures by constraining or regularizing the covariance matrix of their projector's output. This study highlights important properties of such strategy, which we coin Variance-Covariance regularization (VCReg). More precisely, we show that VCReg combined to a MLP projector enforces pairwise independence between the features of the learned representation. This result emerges by bridging VCReg applied on the projector's output to kernel independence criteria applied on the projector's input. This provides the first theoretical motivations and explanations of MLP projectors in SSL. We empirically validate our findings where (i) we put in evidence which projector's characteristics favor pairwise independence, (ii) we use these findings to obtain nontrivial performance gains for VICReg, (iii) we demonstrate that the scope of VCReg goes beyond SSL by using it to solve Independent Component Analysis. We hope that our findings will support the adoption of VCReg in SSL and beyond.

1. INTRODUCTION

Self-Supervised Learning (SSL) via joint embedding architectures has risen to learn visual representations outperforming their supervised counterpart. This paradigm enforces similar embeddings for two augmented versions of the same sample, thus allowing an encoder f to learn a representation for a given modality without labels. Importantly, f could solve the learning task by predicting the same embedding for every input, a failure mode known as collapse. To avoid this, various mechanisms have been proposed hence the diversity of SSL methods (e.g., Grill et al. (2020); Caron et al. (2020; 2021) ; Chen & He (2021)). Most of these are a composition g • f of the encoder with a projector neural network g. Only f is retained after training, in opposition to supervised training that never introduces g. The projector was proposed by Chen et al. (2020) and significantly improved the quality of the learned representation in terms of test accuracy on ImageNet and other downstream tasks. Although some works (Appalaraju et al., 2020; Bordes et al., 2022) provide empirical knowledge on the projector, none provide a theoretical analysis of MLP projectors in practical SSL (Jing et al., 2021; Huang et al., 2021; Wang & Isola, 2020; HaoChen et al., 2021; Tian et al., 2020; Wang & Liu, 2021; Cosentino et al., 2022) . This study sheds a new light on the role of the projector via the lens of Variance-Covariance Regularization (VCReg), a strategy introduced in recent SSL methods (Bardes et al., 2022; Zbontar et al., 2021; Ermolov et al., 2021) to cope with collapse by constraining or regularizing the covariance or cross-correlation of the projector g output to be identity. More precisely, we demonstrate that VC regularization of the projector's output precisely enforces pairwise independence between the components of the projector's input i.e. the encoder's output, and connects this property to projector's characteristics such as width and depth. This provides the first theoretical motivations and explanations of MLP projector in SSL. Fully or partially pairwise independent representations are generally sought for, e.g to disentangle factors of variation (Li et al., 2019; Träuble et al., 2021) . Our experimental analysis suggests that different levels of pairwise independence of the features in the representation emerge from a variety of SSL criteria along with mutual independence. However, as opposed to other frameworks, VCReg allows for theoretical study and explicit control of the learned independence amount. We prove and experimentally validate this property for random projectors, study how it emerges in learned projectors, and use our results to obtain new and significant performance gains on VICReg over Bardes et al. (2022) . We then ablate the SSL context and lean on our findings to show that VCReg of a SSL projector solves Independent Component Analysis (ICA). Beyond providing a novel theoretical understanding of the projector, we believe that this work also leads to a better understanding of VICReg. The scope of VCReg is not limited to SSL: our experiments on ICA open the way to other applications where some degree of independence is needed.

2. BACKGROUND

2.1 MEASURING STATISTICAL INDEPENDENCE USING KERNEL METHODS Measuring the independence between two sets of realizations {X 1 1 , . . . , X N 1 }, {X 1 2 , . . . , X N 2 }, X i 1 ∈ R M , X i 2 ∈ R M is a fundamental task that has a long history in statistics e.g. through the Mutual Information (MI) of the two random variables X 1 and X 2 from which those two sets are independently drawn from (Cover, 1999) . Those variables are said independent if the realization of one does not affect the probability distribution of the other. Computing the MI in practice is known to be challenging (Goebel et al., 2005) , which has led to considerable interest in using alternative criteria e.g. based on functions in Reproducing Kernel Hilbert Spaces (RKHS) (Bach & Jordan, 2002) , a special case of what is known as functional covariance or correlation (Rényi, 1959) . It consists in computing those statistics after nonlinear transformation (Leurgans et al., 1993) as in sup f1∈F1,f2∈F2 Corr⟨f 1 (X 1 ), f 2 (X 2 )⟩, where f 1 , f 2 are constrained to lie within some designed functional space, and the Cov can be used instead of the Corr. If Eq. ( 1) is small enough, then X 1 and X 2 are independent in regard to the functional spaces F 1 , F 2 . For example, if F 1 and F 2 are unit balls in their respective vector spaces, then Eq. ( 1) is just the norm of the usual correlation/covariance operator (Mourier, 1953) which would be enough for independence under joint Gaussian distributions (Melnick & Tenenbein, 1982) . HSIC and pairwise independence. More recently, Gretton et al. ( 2005a) introduced a pairwise independence criterion known as the Hilbert-Schmidt Independence Criterion (HSIC) which can be estimated given empirical samples X 1 and X 2 ∈ R N ×M as HSIC(X 1 , X 2 ) := HSIC(X 1 , X 2 ) = 1 (N -1) 2 Tr(K 1 HK 2 H), with H the centering matrix I -11 T 1 N , (K 1 ) i,j = k 1 (X i 1 , X j 1 ) and (K 2 ) i,j = k 2 (X i 2 , X 2 ) the two kernel matrices of X 1 and X 2 respectively, and k 1 , k 2 of F 1 , F 2 universal kernels such as the Gaussian kernel (see Steinwart (2001); Micchelli et al. (2006) for other examples). Crucially, since HSIC(X 1 , X 2 ) ≥ sup f1∈F1,f2∈F2 Cov⟨f 1 (X 1 ), g(X 2 )⟩, HSIC can be used to test for (pairwise) independence as formalized below. Theorem 1 (Thm. 4 from (Gretton et al., 2005a) ). HSIC(X 1 , X 2 ) = 0 if and only if X 1 and X 2 are independent. 2018) establish the equivalence between dHSIC = 0 and mutual independence along with a statistical test. We provide an estimator of dHSIC given empirical samples X 1 , . . . , X D of the above random variables in R N ×M each as well as implementations of HSIC and dHSIC in Appendix D.



et al. (2005a)  also provide a statistical test for pairwise independence based on HSIC. Further quantities such as upper bounds on the MI can be found in a similar way, e.g. seeThm. 16 in Gretton  et al. (2005b). In our experiments, we will rely on HSIC under the Gaussian kernel scaled by the median of the distribution of pairwise euclidean distances between samples. dHSIC and mutual independence. Mutual independence of a set of D M -dimensional random variables X 1 , . . . , X D is a stronger property than independence between all pairs of random variables in the set. To evaluate it, Pfister et al. (2018) introduce dHSIC, a multivariate extension of HSIC. In short, dHSIC measures the distance between mean embeddings µ under a RKHS F(Smola et al.,  2007)  of the product of distributions and the joint distribution: dHSIC(X 1 , . . . , X D ) := ∥µ(P X1 ⊗ • • • ⊗ P X D ) -µ(P X1,...,X D )∥ F .(4)Similarly to HSIC, Pfister et al. (

