SELF-SUPERVISED LEARNING OF MAXIMUM MANIFOLD CAPACITY REPRESENTATIONS

Abstract

Self-supervised Learning (SSL) provides a strategy for constructing useful representations of images without relying on hand-assigned labels. Many such methods aim to learn a function that maps distinct views of the same scene or object to nearby points in the representation space. These methods are often justified by showing that they optimize an objective that is an approximation of (or correlated with) the mutual information between representations of different views. Here, we recast the problem from the perspective of manifold capacity, a measure that has been used to evaluate the classification capabilities of a representation. Specifically, we develop a contrastive learning framework that aims to maximize the number of linearly separable object manifolds, yielding a Maximum Manifold Capacity Representation (MMCR). We apply this method to unlabeled images, each augmented by a set of basic transformations, and find that it learns meaningful features using the standard linear evaluation protocol. Specifically, we find that MMCRs support performance on object recognition comparable or better than recently developed SSL frameworks, while providing more robustness to adversarial attacks. Finally, empirical analysis reveals the means by which compression of object manifolds gives rise to class separability.

1. INTRODUCTION

Natural images lie, at least locally, within manifolds whose intrinsic dimensionality is low relative to that of their embedding space (the set of pixel intensities). Nevertheless, these manifolds are enormously complex, as evidenced by the variety of natural scenes. A fundamental goal of machine learning is to extract these structures from observations, and use them to perform inference tasks. In the context of recognition, consider the object submanifold, M j , which consists of all images of object j (for example, those taken from different camera locations, or under different lighting conditions). Object recognition networks act to map images within a submanifold to nearby representations, relative to images from other submanifolds, and this concept has been effictively exploited in recent self-supervised learning (SSL) methods (Zbontar et al., 2021; Chen et al., 2020; Caron et al., 2020; Bachman et al., 2019; Wang & Isola, 2020; Wang et al., 2022) . Most of these operate by minimizing pairwise distances between images within submanifolds, while contrastively maximizing pairwise distances between images in different submanifolds. A parallel effort in computational neuroscience has aimed to characterize manifolds in neural representations, and their relationship to underlying neural circuits (Kriegeskorte & Kievit, 2013; Chung & Abbott, 2021) . Studies in various modalities have identified geometric structures in neural data that are associated with behavioral tasks (Bernardi et al., 2020; DiCarlo & Cox, 2007; Hénaff et al., 2021; Gallego et al., 2017; Nieh et al., 2021) , and explored metrics for quantifying these representation geometries. Here, we make use of a recently developed measure of manifold capacity, rooted in statistical physics (Chung et al., 2018) , which has been used to evaluate how many manifolds can be linearly separated within the representation space of various models. We develop a simplified form of this meausre, and incorporate it into a novel contrastive objective, that maximizes the extent of the global image manifold while minimizing that of constituent object manifolds. We apply this to an unlabeled set of images, each augmented to form a small set of samples from their corresponding manifold. We show that the learned representations • support high-quality object recognition, when evaluated using the standard linear evaluation (Chen et al., 2020) paradigm (i.e., training a linear classifer to operate on the output of the unsupervised network). In particular, performance is approximately matched to that of other recently proposed SSL methods. • extract semantically relevant features from the data, that can be revealed by examining the learning signal derived from the unsupervised task • have interpretable geometric properties • are more robust to adversarial attack than those of other recently proposed SSL methods.

1.1. RELATED WORK

Our methodology is closely related to and inspired by recent advances in contrastive self-supervised representation learning (SSL), but has a distinctly different motivation. Many recent frameworks craft objectives that are designed to maximize the mutual information between representations of different views of the same object (Oord et al., 2018; Chen et al., 2020; Oord et al., 2018; Tian et al., 2020; Bachman et al., 2019) ). However, estimating mutual information in high dimensional feature spaces (which is the regime of modern deep learning models models) has been difficult to compute historically (Belghazi et al., 2018) , and furthermore it is not clear that more closely approximating mutual information in the objective produces improved representations (Wang & Isola, 2020).foot_0 By contrast, capacity estimation theories operate in the regime of large ambient dimension as they are derived in the "large N (thermodynamic) limit" (Chung et al., 2018; Bahri et al., 2020) . Therefore we test whether one such measure, which until now had been used to evaluate the quality of representations, might be useful as objective function in SSL. Operationally, many existing methods are optimized to minimize some notion of distance between the representations of different augmented views of the same image, while maximizing the distance between representations of (augmented views of) distinct images (these are thought of as encouraging alignment and uniformity in the framework of Wang & Isola ( 2020)). When taking the view that different views of an image form a continuous manifold that we aim to compress, the distance between two randomly sampled points from said manifold seems a strange choice for the size metric to optimize for. Perhaps unsurprisingly it has been demonstrated on multiple occasions, notably by the success of the "multi-crop," strategy implemented SwAV (Caron et al., 2020) and earlier in the contrastive multiview coding work by Tian et al. ( 2020)). However most commonly the use of multiple views is such that the objective effectively becomes a Monte Carlo estimate with more than one sample of the same pairwise distance function. Rather than using the mean distance or cosine similarity between pairs of points, we use a nuclear norm as a combined measure of size and dimensionality of groups of points, an idea that is strongly motivated by learning theory. The nuclear norm has been previously used to induce or infer low rank structure in the representation of data, for example, in Hénaff et al. (2015) ; Wang et al. (2022) ; Lezama et al. (2018) . In particular, Wang et al. (2022) employ the nuclear norm as a regularizer to supplement an InfoNCE loss. Our approach represents a more radical departure from the traditional InfoNCE loss, as we will detail below. Rather than pair a low-rank prior with a logistic regressionbased likelihood, we make the more symmetric choice of employing a high rank likelihood. This allows the objective to explicitly discourage dimensional collapse, a well known issue in SSL (Jing et al., 2021) . Another consequence of encouraging maximal rank over the dataset is that the objective encourages the representation to form a simplex equiangular tight frame (sETF). sETFs have been shown to be optimal in terms of cross-entropy loss when features lie on the unit hypersphere (Lu & Steinerberger, 2020) , and such representations can be obtained in the supervised setting when optimizing either the traditional cross-entropy loss or a supervised contrastive loss (Papyan et al., 2020; Graf et al., 2021) . Recent work has shown that many popular objectives in SSL can be understood as different methods of approximating a loss function whose minima form sETFs (Dubois et al., 2022) . Our approach is novel, in that it encourages sETF representations by directly optimizing the distribution of singular values, rather than minimizing a cross-entropy loss.



Barlow Twins (Zbontar et al., 2021) notably avoids the curse of dimensionality because their objective effectively estimates information under a Gaussian parameterization rather than doing so non-parametrically as in the InfoNCE loss. Our method also makes use of Guasian/second order parameterizations, as detailed below.

