SELF-SUPERVISED LEARNING OF MAXIMUM MANIFOLD CAPACITY REPRESENTATIONS

Abstract

Self-supervised Learning (SSL) provides a strategy for constructing useful representations of images without relying on hand-assigned labels. Many such methods aim to learn a function that maps distinct views of the same scene or object to nearby points in the representation space. These methods are often justified by showing that they optimize an objective that is an approximation of (or correlated with) the mutual information between representations of different views. Here, we recast the problem from the perspective of manifold capacity, a measure that has been used to evaluate the classification capabilities of a representation. Specifically, we develop a contrastive learning framework that aims to maximize the number of linearly separable object manifolds, yielding a Maximum Manifold Capacity Representation (MMCR). We apply this method to unlabeled images, each augmented by a set of basic transformations, and find that it learns meaningful features using the standard linear evaluation protocol. Specifically, we find that MMCRs support performance on object recognition comparable or better than recently developed SSL frameworks, while providing more robustness to adversarial attacks. Finally, empirical analysis reveals the means by which compression of object manifolds gives rise to class separability.

1. INTRODUCTION

Natural images lie, at least locally, within manifolds whose intrinsic dimensionality is low relative to that of their embedding space (the set of pixel intensities). Nevertheless, these manifolds are enormously complex, as evidenced by the variety of natural scenes. A fundamental goal of machine learning is to extract these structures from observations, and use them to perform inference tasks. In the context of recognition, consider the object submanifold, M j , which consists of all images of object j (for example, those taken from different camera locations, or under different lighting conditions). Object recognition networks act to map images within a submanifold to nearby representations, relative to images from other submanifolds, and this concept has been effictively exploited in recent self-supervised learning (SSL) methods (Zbontar et al., 2021; Chen et al., 2020; Caron et al., 2020; Bachman et al., 2019; Wang & Isola, 2020; Wang et al., 2022) . Most of these operate by minimizing pairwise distances between images within submanifolds, while contrastively maximizing pairwise distances between images in different submanifolds. A parallel effort in computational neuroscience has aimed to characterize manifolds in neural representations, and their relationship to underlying neural circuits (Kriegeskorte & Kievit, 2013; Chung & Abbott, 2021) . Studies in various modalities have identified geometric structures in neural data that are associated with behavioral tasks (Bernardi et al., 2020; DiCarlo & Cox, 2007; Hénaff et al., 2021; Gallego et al., 2017; Nieh et al., 2021) , and explored metrics for quantifying these representation geometries. Here, we make use of a recently developed measure of manifold capacity, rooted in statistical physics (Chung et al., 2018) , which has been used to evaluate how many manifolds can be linearly separated within the representation space of various models. We develop a simplified form of this meausre, and incorporate it into a novel contrastive objective, that maximizes the extent of the global image manifold while minimizing that of constituent object manifolds. We apply this to an unlabeled set of images, each augmented to form a small set of samples from their corresponding manifold. We show that the learned representations

