VARIANCE COVARIANCE REGULARIZATION ENFORCES PAIRWISE INDEPENDENCE IN SELF-SUPERVISED REP-RESENTATIONS

Abstract

Self-Supervised Learning (SSL) methods such as VICReg, Barlow Twins or W-MSE avoid collapse of their joint embedding architectures by constraining or regularizing the covariance matrix of their projector's output. This study highlights important properties of such strategy, which we coin Variance-Covariance regularization (VCReg). More precisely, we show that VCReg combined to a MLP projector enforces pairwise independence between the features of the learned representation. This result emerges by bridging VCReg applied on the projector's output to kernel independence criteria applied on the projector's input. This provides the first theoretical motivations and explanations of MLP projectors in SSL. We empirically validate our findings where (i) we put in evidence which projector's characteristics favor pairwise independence, (ii) we use these findings to obtain nontrivial performance gains for VICReg, (iii) we demonstrate that the scope of VCReg goes beyond SSL by using it to solve Independent Component Analysis. We hope that our findings will support the adoption of VCReg in SSL and beyond.

1. INTRODUCTION

Self-Supervised Learning (SSL) via joint embedding architectures has risen to learn visual representations outperforming their supervised counterpart. This paradigm enforces similar embeddings for two augmented versions of the same sample, thus allowing an encoder f to learn a representation for a given modality without labels. Importantly, f could solve the learning task by predicting the same embedding for every input, a failure mode known as collapse. To avoid this, various mechanisms have been proposed hence the diversity of SSL methods (e.g., Grill et al. (2020); Caron et al. (2020; 2021) ; Chen & He (2021)). Most of these are a composition g • f of the encoder with a projector neural network g. Only f is retained after training, in opposition to supervised training that never introduces g. The projector was proposed by Chen et al. ( 2020) and significantly improved the quality of the learned representation in terms of test accuracy on ImageNet and other downstream tasks. Although some works (Appalaraju et al., 2020; Bordes et al., 2022) provide empirical knowledge on the projector, none provide a theoretical analysis of MLP projectors in practical SSL (Jing et al., 2021; Huang et al., 2021; Wang & Isola, 2020; HaoChen et al., 2021; Tian et al., 2020; Wang & Liu, 2021; Cosentino et al., 2022) . This study sheds a new light on the role of the projector via the lens of Variance-Covariance Regularization (VCReg), a strategy introduced in recent SSL methods (Bardes et al., 2022; Zbontar et al., 2021; Ermolov et al., 2021) to cope with collapse by constraining or regularizing the covariance or cross-correlation of the projector g output to be identity. More precisely, we demonstrate that VC regularization of the projector's output precisely enforces pairwise independence between the components of the projector's input i.e. the encoder's output, and connects this property to projector's characteristics such as width and depth. This provides the first theoretical motivations and explanations of MLP projector in SSL. Fully or partially pairwise independent representations are generally sought for, e.g to disentangle factors of variation (Li et al., 2019; Träuble et al., 2021) . Our experimental analysis suggests that different levels of pairwise independence of the features in the representation emerge from a variety of SSL criteria along with mutual independence. However, as opposed to other frameworks, VCReg allows for theoretical study and explicit control of the learned independence amount. We prove and experimentally validate this property for random projectors, study how it

