ON INFORMATION MAXIMISATION IN MULTI-VIEW SELF-SUPERVISED LEARNING

Abstract

The strong performance of multi-view self-supervised learning (SSL) prompted the development of many different approaches (e.g. SimCLR, BYOL, and DINO). A unified understanding of how each of these methods achieves its performance has been limited by apparent differences across objectives and algorithmic details. Through the lens of information theory, we show that many of these approaches maximise an approximate lower bound on the mutual information between the representations of multiple views of the same datum. Further, we observe that this bound decomposes into a "reconstruction" term, treated identically by all SSL methods, and an "entropy" term, where existing SSL methods differ in their treatment. We prove that an exact optimisation of both terms of this lower bound encompasses and unifies current theoretical properties such as recovering the true latent variables of the underlying generative process (Zimmermann et al., 2021) or or isolating content from style in such true latent variables (Von Kügelgen et al., 2021). This theoretical analysis motivates a naive but principled objective (En-tRec), that directly optimises both the reconstruction and entropy terms, thus benefiting from said theoretical properties unlike other SSL frameworks. Finally, we show EntRec achieves a downstream performance on-par with existing SSL methods on ImageNet (69.7% after 400 epochs) and on an array of transfer tasks when pre-trained on ImageNet. Furthermore, EntRec is more robust to modifying the batch size, a sensitive hyperparameter in other SSL methods.

1. INTRODUCTION

Representation learning commonly tackles the problem of learning compressed representations of data which capture their semantic information. A necessary, but not sufficient, property of a good representation is thus that it is highly informative of said data. For this reason, many representation learning methods aim to maximise the mutual information between the input data and the representations, while including some biases in the model that steer that information to be semantic, e.g. (Agakov, 2004; Alemi et al., 2017; Hjelm et al., 2018; Oord et al., 2018; Velickovic et al., 2019) . Moreover, mutual information has been the central object to understand the performance of many of these algorithms (Saxe et al., 2019; Rodríguez Gálvez et al., 2020; Goldfeld & Polyanskiy, 2020) . A subfield of representation learning is self-supervised learning (SSL), which consists of algorithms that learn representations by means of solving an artificial task with self-generated labels. A particularly successful approach to SSL is multi-view SSL, where different views of the input data are generated and the self-generated task is to make sure that representations of one view are predictive of the representations of the other views c.f. (Jing & Tian, 2020; Liu et al., 2022) . Multi-view SSL algorithms based on the InfoNCE (Oord et al., 2018 ) like (Bachman et al., 2019; Federici et al., 2020; Tian et al., 2020a) focus on maximising the mutual information between the representations and the input data by maximising the mutual information between the representations of different views (Poole et al., 2019) . Similarly, Shwartz-Ziv et al. showed that (Bardes et al., 2022, VICReg) also maximises this information, even though it was not designed for this purpose. Moreover, Tian et al. (2020b); Tsai et al. (2020) provide perspectives on why maximising this mutual information is attractive and discuss some of its properties. However, Tschannen et al. ( 2019); McAllester & Stratos (2020) warn about the caveats of this maximisation (e.g. that it is not sufficient for good representations). Here, we complement these efforts from multiple fronts and contribute: • Showing that maximising the lower bound (1) on the mutual information between representations of different views has desirable properties in good representations (Section 2). More precisely, we show that this maximisation unifies current theories on learning the true explanatory factors of the input (Zimmermann et al., 2021) and separating semantic from irrelevant information (Von Kügelgen et al., 2021 ). • Showing how many existing multi-view SSL algorithms also maximise this mutual information, although not exactly maximising the lower bound (1). This completes the picture of contrastive methods with an analysis of ( This paper is a recognition of the importance of maximising the mutual information between the representations of different views of the input data, as doing so by maximising (1) has desirable properties (Section 2), and many methods that maximise it (Section 3), including naive ones (Section 4), have good empirical performance (Section 5). However, since maximising mutual information is not sufficient for good representations (Tschannen et al., 2019) , this paper is also a call to include more biases in the model and the optimisation enforcing the representations to learn semantic information. Appendix A completes the positioning of the paper with respect to related work. Notation Upper-case letters X represent random objects, lower-case letters x their realisations, calligraphic letters X their outcome space, and P X their distribution. Random objects X are assumed to have a density p X with respect to some measure µ,foot_0 and the expectation of a function f of X is written as E[f (X)] := E x∼p X [f (x)]. When two random objects X, Y are considered, the conditional density of X given Y is written as p X|Y , and for each realisation y of Y it describes the density p X|Y =y . Sometimes, the notation is abused to write a "variational" density q X|Y of X given Y . Formally, this amounts to considering a different random object X such that p X|Y = q X|Y . The mutual information between two random objects X and Y is written as I(X; Y ), and their conditional mutual information given the random object Z as I(X; Y |Z). The Shannon entropy and differential entropy of a random object X are both written as H(X), and are clear from the context. The Jensen-Shannon divergence between two distributions P and Q is written as D JS (P∥Q). A set of k elements x (1) , . . . , x (k) is denoted as x (1:k) , a (possibly unordered) subsequence x (a) , . . . , x (b) of those elements is denoted as x (a:b) , and all the elements in x (1:k) except of x (i) is denoted as x (-i) .

2. MULTI-VIEW SSL AND MUTUAL INFORMATION

In multi-view SSL, two (or more) views (potentially generated using augmentations) of the same data sample X are generated (Bachman et al., 2019; Tian et al., 2020a; b; Chen et al., 2020a; Caron et al., 2020; Zbontar et al., 2021) . Views V 1 , V 2 are engineered such that most of the semantic information S of the data is preserved (Tian et al., 2020b) . This process generates two branches where the views are processed to generate representations R 1 , R 2 which are later projected into a lower dimensional space Z 1 , Z 2 . Finally, the model's parameters θ are optimised so that the projected representations (projections) from one branch, say Z 1 , are predictive of the representations of the other branch Z 2 (see Figure 1 ). In particular, as shown in Section 3, many mutli-view SSL methods aim to maximise the mutual information between the projections I(Z 1 ; Z 2 ). Consider the following decomposition of the mutual information (Agakov, 2004; Rodríguez Gálvez et al., 2020 ) I(Z 1 ; Z 2 ) = H(Z 2 ) -H(Z 2 |Z 1 ) ≥ Entropy H(Z 2 ) + Reconstruction E log q Z2|Z1 (Z 2 )] .



Here, this measure will either be the Lebesgue measure and pX will denote the standard probability density function (pdf) or the counting measure and pX will denote the standard probability mass function (pmf).



Chen et al., 2020a, SimCLR), where such a result was only known for (Tian et al., 2020a, CMC)-like methods(Poole et al., 2019; Wu  et al., 2020)  under the InfoNCE (Oord et al., 2018) assumptions requiring i.i.d. negative samples. It also provides a unifying framework with other projections' reconstruction methods such as (Chen & He, 2021, SimSiam), (Grill et al., 2020, BYOL), (Caron et al., 2018; 2020, DeepCluster and SwAV), and (Caron et al., 2021, DINO).• Demonstrating how a proposed naive method that directly maximises the aforementioned bound (1) on this mutual information (EntRec) has comparable performance to current state-of-the-art methods and is more robust to changes in training hyperparameters such as the batch size (Section 4 and Section 5).

