SELF-SUPERVISED LEARNING FROM A MULTI-VIEW PERSPECTIVE

Abstract

As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework's empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed.

1. INTRODUCTION

Self-supervised learning (SSL) (Zhang et al., 2016; Devlin et al., 2018; Oord et al., 2018; Tian et al., 2019) learns representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. Empirical evidence suggests that the learned representations can generalize well to a wide range of downstream tasks, even when the SSL objective has not utilize any downstream supervision during training. For example, SimCLR (Chen et al., 2020) defines a contrastive loss (i.e., an SSL objective) between images with different augmentations (i.e., one as the input and the other as the self-supervised signal). Then, one can take SimCLR as features extractor and adopt the features to various computer vision applications, spanning image classification, object detection, instance segmentation, and pose estimation (He et al., 2019) . Despite success in practice, only a few work (Arora et al., 2019; Lee et al., 2020; Tosh et al., 2020) provide theoretical insights into the learning efficacy of SSL. Our work shares a similar goal to explain the success of SSL, from the perspectives of Information Theory (Cover & Thomas, 2012) and multi-view representationfoot_0 . To understand (a subset 2 of) SSL, we start by the following multi-view assumption. First, we regard the input and the self-supervised signals as two corresponding views of the data. Using our running example, in SimCLR (Chen et al., 2020), the augmented images (i.e., the input and the self-supervised signal) are an image with different views. Second, we adopt a common assumption in multi-view learning: either view alone is (approximately) sufficient for the downstream tasks (see Assumption 1 in prior work (Sridharan & Kakade, 2008) ). The assumption suggests that the image augmentations (e.g., changing the style of an image) should not affect the labels of images, or analogously, the selfsupervised signal contains most (if not all) of the information that the input has about the downstream tasks. With this assumption, our first contribution is to formally show that the self-supervised learned representations can 1) extract all the task-relevant information (from the input) with a potential loss; and 2) discard all the task-irrelevant information (from the input) with a fixed gap. Then, using classification task as an example, we are able the quantify the smallest generalization error (Bayes error rate) given the discussed task-relevant and task-irrelevant information. As the second contribution, our analysis 1) connects prior arts for SSL on contrastive (Oord et al., 2018; Bachman et al., 2019; Chen et al., 2020; Tian et al., 2019) and predictive learning (Zhang et al., 2016; Vondrick et al., 2016; Tulyakov et al., 2018; Devlin et al., 2018) approaches; and 2) paves the way to a larger space of composing SSL objectives to extract task-relevant and discard task-irrelevant information simultaneously. For instance, the combination between the contrastive and predictive learning approaches achieves better performance than contrastive-or predictive-alone objective and enjoys less over-fitting problem. We also present a new objective to discard task-irrelevant information. The objective can be easily incorporated with prior self-supervised learning objectives. We conduct controlled experiments on visual (the first set) and visual-textual (the second set) selfsupervised representation learning. The first set of experiments are performed when the multi-view assumption is likely to hold. The goal is to compare different compositions of SSL objectives on extracting task-relevant and discarding task-irrelevant information. The second set of experiments are performed when the input and the self-supervised signal lie in very different modalities. Under this cross-modality setting, the task-relevant information may not mostly lie in the shared information between the input and the self-supervised signal. The goal is to examine SSL objectives' generalization, where the multi-view assumption is likely to fail.

2. A MULTI-VIEW INFORMATION-THEORETICAL FRAMEWORK

Notations. For the input, we denote its random variable as X, sample space as X , and outcome as x. We learn a representation (Z X / Z/ z x ) from the input through a deterministic mapping F X : Z X = F X (X). For the self-supervised signal, we denote its random variable/ sample space/ outcome as S/ S/ s. Two sample spaces can be different between the input and the self-supervised signal: X = S. The information required for downstream tasks is referred to as "task-relevant information": T / T / t. Note that SSL has no access to the task-relevant information. Lastly, we use I(A; B) to represent mutual information, I(A; B|C) to represent conditional mutual information, H(A) to represent the entropy, and H(A|B) to represent conditional entropy for random variables A/B/C. We provide high-level takeaways for our main results in Figure 1 . We defer all proofs to Supplementary.

2.1. MULTI-VIEW ASSUMPTION

In our paper, we regard the input (X) and the self-supervised signals (S) as two views of the data. 



The work(Lee et al., 2020; Tosh et al., 2020) are done concurrent and in parallel, and part of their assumptions/ conclusions are similar to ours. We will elaborate the differences more in the related work section.2 We discuss the limitations of the multi-view assumption in Section 2.1.



Figure 1: High-level takeaways for our main results using information diagrams. (a) We present to learn minimal and sufficient self-supervision: minimize H(ZX |S) for discarding task-irrelevant information and maximize I(ZX ; S) for extracting task-relevant information. (b) The resulting learned representation ZX * contains all task relevant information from the input with a potential loss info and discards task-irrelevant information with a fixed gap I(X; S|T ). (c) Our core assumption: the self-supervised signal is approximately redundant to the input for the task-relevant information.

Here, we provide a table showing different X/S in various SSL frameworks:We note that not all SSL frameworks realize the inputs and the self-supervised signals as corresponding views. For instance, Jigsaw puzzle (Noroozi & Favaro, 2016) considers (shuffled) image patches as the input and the positions of the patches as the self-supervised signals. Another example is Learning by Predicting Rotations(Gidaris et al., 2018), which considers an image (rotating with a specific

