RETHINKING POSITIVE SAMPLING FOR CONTRASTIVE LEARNING WITH KERNEL Anonymous authors Paper under double-blind review

Abstract

Data augmentation is a crucial component in unsupervised contrastive learning (CL). It determines how positive samples are defined and, ultimately, the quality of the representation. Even if efforts have been made to find efficient augmentations for ImageNet, CL underperforms compared to supervised methods and it is still an open problem in other applications, such as medical imaging, or in datasets with easy-to-learn but irrelevant imaging features. In this work, we propose a new way to define positive samples using kernel theory along with a novel loss called decoupled uniformity. We propose to integrate prior information, learnt from generative models viewed as feature extractor, or given as auxiliary attributes, into contrastive learning, to make it less dependent on data augmentation. We draw a connection between contrastive learning and the conditional mean embedding theory to derive tight bounds on the downstream classification loss. In an unsupervised setting, we empirically demonstrate that CL benefits from generative models, such as VAE and GAN, to less rely on data augmentations. We validate our framework on vision and medical datasets including CIFAR10, CIFAR100, STL10, ImageNet100, CheXpert and a brain MRI dataset. In the weakly supervised setting, we demonstrate that our formulation provides state-of-the-art results.

1. INTRODUCTION

Figure 1 : Illustration of the proposed method. Each point is an original image x. Two points are connected if they can be transformed into the same augmented image using a distribution of augmentations A. Colors represent semantic (unknown) classes and light disks represent the support of augmentations for each sample x, A(•|x). From an incomplete augmentation graph (1) where intra-class samples are not connected (e.g. augmentations are insufficient or not adapted), we reconnect them using a kernel defined on prior information (either learnt with generative model, viewed as feature extractor, or given as auxiliary attributes). The extended augmentation graph (3) is the union between the (incomplete) augmentation graph (1) and the kernel graph (2). In (2), the gray disk indicates the set of points x′ that are close to the anchor (blue star) in the kernel space. Contrastive Learning (CL)(44; 3; 4; 7; 10) is a paradigm designed for representation learning which has been applied to unsupervised (10; 13), weakly supervised(55; 20) and supervised problems (37). It gained popularity during the last years by achieving impressive results in the unsupervised setting on standard vision datasets (e.g. ImageNet) where it almost matched the performance of its supervised counterpart (10; 29). The objective in CL is to increase the similarity in the representation space between positive samples (semantically close), while decreasing the similarity between negative samples (semantically distinct). Despite its simple formulation, it requires the definition of a similarity function (that can be seen as an energy term ( 42)),and of a rule to decide whether a sample should be considered positive or negative. Similarity functions, such as the Euclidean scalar product (e.g. InfoNCE(44)), take as input the latent representations of an encoder f ∈ F, such as a CNN (11) or a Transformer (9) for vision datasets. In supervised learning (37), positives are simply images belonging to the same class while negatives are images belonging to different classes. In unsupervised learning (10), since labels are unknown, positives are usually defined as transformed versions (views) of the same original image (a.k.a. the anchor) and negatives are the transformed versions of all other images. As a result, the augmentation distribution A used to sample both positives and negatives is crucial (10) and it conditions the quality of the learnt representation. The most-used augmentations for visual representations involve aggressive crop and color distortion. Cropping induces representations with high occlusion invariance (46) while color distortion may avoids the encoder f to take a shortcut (10) while aligning positive sample representations and fall into the simplicity bias (51). Nevertheless, learning a representation that mainly relies on augmentations comes at a cost: both crop and color distortion induce strong biases in the final representation (46). Specifically, dominant objects inside images can prevent the model from learning features of smaller objects (12) (which is not apparent in object-centric datasets such as ImageNet) and few, irrelevant and easy-tolearn features, that are shared among views, are sufficient to collapse the representation (12) (a.k.a feature suppression). Finding the right augmentations in other visual domains, such as medical imaging, remains an open challenge (20) since we need to find transformations that preserve semantic anatomical structures (e.g. discriminative between pathological and healthy) while removing unwanted noise. If the augmentations are too weak or inadequate to remove irrelevant signal w.r.t. a discrimination task, then how can we define positive samples? In our work, we propose to integrate prior information, learnt from generative models or given as auxiliary attributes, into contrastive learning, to make it less dependent on data augmentation. Using the theoretical understanding of CL through the augmentation graph, we make the connection with kernel theory and introduce a novel loss with theoretical guarantees on downstream performance. Prior information is integrated into the proposed contrastive loss using a kernel. In the unsupervised setting, we leverage pre-trained generative models, such as GAN (24) and VAE (38), to learn a prior representation of the data. We provide a solution to the feature suppression issue in CL ( 12) and also demonstrate SOTA results with weaker augmentations on visual benchmarks. In visual domain where data augmentations are not adapted to the downstream task (e.g. medical imaging), we show that we can improve CL, alleviating the need to find efficient augmentations. In the weakly supervised setting, we use instead auxiliary/prior information, such as image attributes (e.g. birds color or size) and we show better performance than previous conditional formulations based on these attributes (55). In summary, we make the following contributions: 1. We propose a new framework for contrastive learning allowing the integration of prior information, learnt from generative models or given as auxiliary attributes, into the positive sampling. 2. We derive theoretical bounds on the downstream classification risk that rely on weaker assumptions for data augmentations than previous works on CL. 3. We empirically show that our framework can benefit from the latest advances of generative models to learn a better representation while relying on less augmentations. 4. We show that we achieve SOTA results in the unsupervised and weakly supervised setting.

2. RELATED WORKS

In a weakly supervised setting, recent studies (20; 55) have shown that positive samples can be defined conditionally to an auxiliary attribute in order to improve the final representation, in particular for medical imaging (20). From an information bottleneck perspective, these approaches essentially

