WHAT DO WE MAXIMIZE IN SELF-SUPERVISED LEARNING AND WHY DOES GENERALIZATION EMERGE?

Abstract

In this paper, we provide an information-theoretic (IT) understanding of selfsupervised learning methods, their construction, and optimality. As a first step, we demonstrate how IT quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Secondly, we demonstrate how different SSL models can be (re)discovered based on first principles and highlight the underlying assumptions of different SSL variants. Based on this understanding, we present new SSL methods that are superior to existing methods in terms of performance. Third, we derive a novel generalization bound based on our IT understanding of SSL methods, providing generalization guarantees for the downstream supervised learning task. As a result of this bound, along with our unified view of SSL, we can compare the different approaches and provide general guidelines to practitioners. Consequently, our derivation and insights contribute to a better understanding of SSL and transfer learning from a theoretical and practical perspective.

1. INTRODUCTION

Self-Supervised Learning methods (SSL) learn representations using a surrogate objective between inputs and self-defined signals. In SimCLR (Chen et al., 2020) , for example, a contrastive loss is defined that makes representations for different versions of the same image similar, while making the representations for different images different. After optimizing the surrogate objective, the pre-trained model is used as a feature extractor for a downstream supervised task, such as image classification, object detection, instance segmentation and transfer learning (Caron et al., 2021; Chen et al., 2020; Misra & Maaten, 2020; Shwartz-Ziv et al., 2022) . However, despite success in practice, only a few number of authors (Arora et al., 2019; Lee et al., 2021a) have sought to provide theoretical insights about the effectiveness of SSL. In recent years, information theory methods have played a key role in several deep learning achievements, from practical applications in representation learning (Alemi et al., 2016) , to theoretical investigations (Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv, 2022) . Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities. Specifically, many works have attempted to analyze SSL from an information theory perspective. An example is the use of the renowned information maximization (InfoMax) principle (Linsker, 1988) in SSL (Bachman et al., 2019) . However, looking at these works may be confusing. Numerous objective functions are presented without a rigorous justification, some contradicting each other, as well as many implicit assumptions (Kahana & Hoshen, 2022; Wang et al., 2022; Lee et al., 2021b) Moreover, these works rely on a crucial assumption: a stochastic DN mapping, which is rarely the case nowadays. This paper presents a unified framework for SSL methods from an information theory perspective, which can be applied to deterministic DN training. We summarize our contributions into four points: (i) First, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities. (iii) Third, we present new SSL methods based on our analysis and empirically validate their superior performance. (iv) Fourth, we study how the optimization of information-theoretic quantities is related to the final performance in the downstream task using a new generalization bound.

2. BACKGROUND

Continuous Piecewise Affine (CPA) Mappings. A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0,. . . ,k -1. As we will focus on affine splines (k = 1), we define this case only for concreteness. An K-dimensional affine spline f produces its output via 2016). We will omit the Θ notation for clarity unless needed. The only assumption we require for our study is that the non-linearities present in the DN are CPA, as is the case with (leaky-) ReLU, absolute value, and max-pooling. In that case, the entire input-output mapping becomes a CPA spline with an implicit partition Ω, the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero & Baraniuk, 2018) . For smooth nonlinearities, our results hold from a first-order Taylor approximation argument. f (z) = ω∈Ω (A ω z + b ω )1 {z∈ω} , with input z ∈ R D and A ω ∈ R K×D , b ω ∈ R K , ∀ω ∈ Ω Self-Supervised Learning. Joint embedding methods learn the DN parameters Θ without supervision and input reconstruction. The difficulty of SSL is to produce a good representation for downstream tasks whose labels are not available during training -while avoiding a trivially simple solution where the model maps all inputs to constant output.  Z = [f (x 1 ), . . . , f (x N )] and Z ′ = [f (x ′ 1 ), . . . , f (x ′ N )] each of size (N × K). Denoting by C the (K × K) covariance matrix obtained from [Z, Z ′ ] we obtain the VICReg triplet loss L= 1 K K k=1   α max 0, γ -C k,k + ϵ +β k ′ ̸ =k (C k,k ′ ) 2   + γ∥Z -Z ′ ∥ 2 F /N. Deep Networks and Information-Theory. Recently, information-theoretic methods have played a key role in several remarkable deep learning achievements (Alemi et al., 2016; Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv & Tishby, 2017) . Moreover, different deep learning problems have been successfully approached by developing and applying informationtheoretic estimators and learning principles (Hjelm et al., 2018; Belghazi et al., 2018; Piran et al., 2020; Shwartz-Ziv et al., 2018) . There is, however, a major problem when it comes to analyzing information-theoretic objectives in deterministic deep neural networks: the source of randomness. The mutual information between the input and the representation in such networks is infinite, resulting in ill-posed optimization problems or piecewise constant, making gradient-based optimization methods ineffective (Amjad & Geiger, 2019). To solve these problems, researchers have proposed several solutions. For SSL, stochastic deep networks with variational bounds could be used, where the output of the deterministic network is used as parameters of the conditional distribution (Lee et al.,



the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 (R

Many methods have been proposed to solve this problem, see Balestriero & LeCun (2022) for a summary and connections between methods. Contrastive methods learn representations by contrasting positive and negative examples, e.g. Sim-CLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018). Other recent work introduced non-contrastive methods that employ different regularization methods to prevent collapsing of the representation. Several papers used stop-gradients and extra predictors to avoid collapse (Chen & He, 2021; Grill et al., 2020) while Caron et al. (2020) uses an additional clustering step. As opposed to contrastive methods, noncontrastive methods do not explicitly rely on negative samples. Of particular interest to us is the VICReg method (Bardes et al., 2021) that considers two embedding batches

