WHAT DO WE MAXIMIZE IN SELF-SUPERVISED LEARNING AND WHY DOES GENERALIZATION EMERGE?

Abstract

In this paper, we provide an information-theoretic (IT) understanding of selfsupervised learning methods, their construction, and optimality. As a first step, we demonstrate how IT quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Secondly, we demonstrate how different SSL models can be (re)discovered based on first principles and highlight the underlying assumptions of different SSL variants. Based on this understanding, we present new SSL methods that are superior to existing methods in terms of performance. Third, we derive a novel generalization bound based on our IT understanding of SSL methods, providing generalization guarantees for the downstream supervised learning task. As a result of this bound, along with our unified view of SSL, we can compare the different approaches and provide general guidelines to practitioners. Consequently, our derivation and insights contribute to a better understanding of SSL and transfer learning from a theoretical and practical perspective.

1. INTRODUCTION

Self-Supervised Learning methods (SSL) learn representations using a surrogate objective between inputs and self-defined signals. In SimCLR (Chen et al., 2020) , for example, a contrastive loss is defined that makes representations for different versions of the same image similar, while making the representations for different images different. After optimizing the surrogate objective, the pre-trained model is used as a feature extractor for a downstream supervised task, such as image classification, object detection, instance segmentation and transfer learning (Caron et al., 2021; Chen et al., 2020; Misra & Maaten, 2020; Shwartz-Ziv et al., 2022) . However, despite success in practice, only a few number of authors (Arora et al., 2019; Lee et al., 2021a) have sought to provide theoretical insights about the effectiveness of SSL. In recent years, information theory methods have played a key role in several deep learning achievements, from practical applications in representation learning (Alemi et al., 2016) , to theoretical investigations (Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv, 2022) . Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities. Specifically, many works have attempted to analyze SSL from an information theory perspective. An example is the use of the renowned information maximization (InfoMax) principle (Linsker, 1988) in SSL (Bachman et al., 2019) . However, looking at these works may be confusing. Numerous objective functions are presented without a rigorous justification, some contradicting each other, as well as many implicit assumptions (Kahana & Hoshen, 2022; Wang et al., 2022; Lee et al., 2021b) Moreover, these works rely on a crucial assumption: a stochastic DN mapping, which is rarely the case nowadays. This paper presents a unified framework for SSL methods from an information theory perspective, which can be applied to deterministic DN training. We summarize our contributions into four points: (i) First, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities. (iii) Third, we present new SSL

