A THEORETICAL STUDY OF INDUCTIVE BIASES IN CONTRASTIVE LEARNING

Abstract

Understanding self-supervised learning is important but challenging. Previous theoretical works study the role of pretraining losses, and view neural networks as general black boxes. However, the recent work of Saunshi et al. (2022) argues that the model architecture -a component largely ignored by previous works -also has significant influences on the downstream performance of selfsupervised learning. In this work, we provide the first theoretical analysis of self-supervised learning that incorporates the effect of inductive biases originating from the model class. In particular, we focus on contrastive learning -a popular self-supervised learning method that is widely used in the vision domain. We show that when the model has limited capacity, contrastive representations would recover certain special clustering structures that are compatible with the model architecture, but ignore many other clustering structures in the data distribution. As a result, our theory can capture the more realistic setting where contrastive representations have much lower dimensionality than the number of clusters in the data distribution. We instantiate our theory on several synthetic data distributions, and provide empirical evidence to support the theory.

1. INTRODUCTION

Recent years have witnessed the effectiveness of pre-trained representations, which are learned on unlabeled data with self-supervised losses and then adapted to a wide range of downstream tasks (Chen et al., 2020a; b; He et al., 2020; Caron et al., 2020; Chen et al., 2020c; Gao et al., 2021; Su et al., 2021; Chen & He, 2020; Brown et al., 2020; Radford et al., 2019) . However, understanding the empirical success of this emergent pre-training paradigm is still challenging. It requires novel mathematical frameworks and analyses beyond the classical statistical learning theory. The prevalent use of deep neural networks in self-supervised learning also adds to the mystery. Many theoretical works focus on isolating the roles of self-supervised losses, showing that they encourage the representations to capture certain structures of the unlabeled data that are helpful for downstream tasks (Arora et al., 2019; HaoChen et al., 2021; 2022; Wei et al., 2021; Xie et al., 2021; Saunshi et al., 2020) . However, these works oftentimes operate in the sufficient pre-training data (polynomial in the dimensionality) or even infinite pre-training data regime, and view the neural network as a black box. The only relevant property of neural networks in these works is that they form a parameterized model class with finite complexity measure (e.g., Rademacher complexity). Recently, Saunshi et al. (2022) argue that the pre-training loss is not the only contributor to the performance of self-supervised learning, and that previous works which view neural networks as a black box cannot tell apart the differences in downstream performance between architectures (e.g., ResNet (He et al., 2015) vs vision transformers (Dosovitskiy et al., 2020) ). Furthermore, self-supervised learning with an appropriate architecture can possibly work under more general conditions and/or with fewer pre-training data than predicted by these results on general architecture. Therefore, a more comprehensive and realistic theory needs to take into consideration the inductive biases of architecture. This paper provides the first theoretical analyses of the inductive biases of nonlinear architectures in self-supervised learning. Our theory follows the setup of the recent work by HaoChen et al. ( 2021) 2022). The orange points are the original data and blue points are augmented data (obtained by adding noise in the spurious dimension). The dimension invariant to augmentation is desired. Edges represent positive pairs that are constructed from augmentation. We say a real-valued function implements a cluster if it outputs 1 on the cluster and outputs 0 on all other data. We note that here implementing means matching the exact value, rather than simply matching the label after applying some linear threshold. The figure above shows two possible ways to partition the data into two clusters, but only the one on the left-hand side (which captures the invariant dimension) is implementable by a linear function. Here we use black numbers to indicate the target output on the data, and green numbers to indicate the output of the implementing function which extrapolates outside of the data support. Note that linear model is not composed with a threshold function. The partition on the right hand side is not implementable because any linear model that outputs constant 1 on the upper-left small cluster would also output 1 on the bottom-left small cluster due to linear extrapolation. Here we use red numbers to indicate the output of the linear function that contradicts with the target. on contrastive learning and can be seen as a refinement of their results by further characterizing the model architecture's impact on the learned representations. 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 invariant dimension spurious dimension 0 0 0 0 0 invariant dimension spurious dimension 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 not implementable implementable 1 1 1 1 1 0.5 1 1 1 1 1 1 1 1 1 1 0 1 We recall that HaoChen et al. (2021) shows that contrastive learning, with sufficient data and a parameterized model class of finite complexity, is equivalent to spectral clustering on a so-called population positive-pair graph, where nodes are augmented images and an edge between the nodes x and x ′ is weighted according to the probability of encountering (x, x ′ ) as a positive pair. They essentially assume that the positive-pair graph contains several major semantically-meaningful clusters, and prove that contrastive representations exhibit a corresponding clustering structure in the Euclidean space, that is, images with relatively small graph distance have nearby representations. Their results highly rely on the clustering property of the graph-the representation dimensionality and pre-training sample complexity both scale in the number of clusters. The important recent work of Saunshi et al. (2022) , however, demonstrates with a synthetic setting that contrastive learning can provably work with linear model architectures even if the number of clusters is huge (e.g., exponential in the dimensionality). Beyond the simple synthetic example discussed in their paper, there has been no previous work that formally characterizes this effect in a general setting. In this work, we develop a general theory that leverages the inductive bias to avoid the dependency on the potentially huge number of clusters: although there exists a large number of clusters in the positive-pair graph, the number of clusters implementable by the model (which we call minimal implementable clusters) could be much smaller, even exponentially. Figure 1 shows an example where a linear function can only implement one clustering structure but not the other, despite both being valid clusters in the positive-pair graph. It's possible that a minimal implementable cluster consists of multiple well-separated sub-clusters but none of these sub-clusters can be implemented by the model class.



Figure 1: A simple example where the linear function class learns the correct feature and ignores the spurious feature. A simplified version of the synthetic example proposed in Saunshi et al. (2022). The orange points are the original data and blue points are augmented data (obtained by adding noise in the spurious dimension). The dimension invariant to augmentation is desired. Edges represent positive pairs that are constructed from augmentation. We say a real-valued function implements a cluster if it outputs 1 on the cluster and outputs 0 on all other data. We note that here implementing means matching the exact value, rather than simply matching the label after applying some linear threshold. The figure above shows two possible ways to partition the data into two clusters, but only the one on the left-hand side (which captures the invariant dimension) is implementable by a linear function. Here we use black numbers to indicate the target output on the data, and green numbers to indicate the output of the implementing function which extrapolates outside of the data support. Note that linear model is not composed with a threshold function. The partition on the right hand side is not implementable because any linear model that outputs constant 1 on the upper-left small cluster would also output 1 on the bottom-left small cluster due to linear extrapolation. Here we use red numbers to indicate the output of the linear function that contradicts with the target.

