AUGMENTATION COMPONENT ANALYSIS: MODELING SIMILARITY VIA THE AUGMENTATION OVERLAPS

Abstract

Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution, and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show that our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.

1. INTRODUCTION

The rapid development of contrastive learning has pushed self-supervised representation learning to unprecedented success. Many contrastive learning methods surpass traditional pretext-based methods by a large margin and even outperform representation learned by supervised learning (Wu et al., 2018; van den Oord et al., 2018; Tian et al., 2020a; He et al., 2020; Chen et al., 2020a; c) . The key idea of self-supervised contrastive learning is to construct views of samples via modern data augmentations (Chen et al., 2020a) . Then discriminative embeddings are learned by pulling together views of the same sample in the embedding space while pushing apart views of others. Contrastive learning methods utilize the semantic invariance between views of the same sample, but the semantic relationship between samples is ignored. Instead of measuring the similarity between certain augmented views of samples, we claim that the similarity between the augmentation distributions of samples can reveal the sample-wise similarity better. In other words, semantically similar samples have similar sets of views. As shown in Figure 1 left, two images of deer create many similar crops, and sets of their augmentation results, i.e., their distributions, overlap much. In contrast, a car image will rarely be augmented to the same crop as a deer, and their augmentation distributions overlap little. In Figure 1 right, we verify the motivation numerically. We approximate the overlaps between image augmentations with a classical image matching algorithm (Zitova & Flusser, 2003) , which counts the portion of the key points matched in the raw images. We find samples of the same class overlap more than different classes on average, supporting our motivation. Therefore, we establish the semantic relationship between samples in an unsupervised manner based on the similarity of augmentation distributions, i.e., how much they overlap. In this paper, we propose to describe data directly by their augmentation distributions. We call the feature of this kind the augmentation feature. The elements of the augmentation feature represent the probability of getting a certain view by augmenting the sample as shown in the left of Figure 2 . The augmentation feature serves as an "ideal" representation since it encodes the augmentation information without any loss and we can easily obtain the overlap of two samples from it. However, not only its elements are hard to calculate, but also such high-dimensional embeddings are impractical to use. Inspired by the classical strategy to deal with high-dimensional data, we propose Augmentation Component Analysis (ACA), which employs the idea of PCA (Hotelling, 1933) to perform dimension reduction on augmentation features previously mentioned. ACA reformulates the steps of extracting principal components of the augmentation features with a contrastive-like loss. With the learned principal components, another on-the-fly loss embeds samples effectively. ACA learns operable low-dimensional embeddings theoretically preserving the augmentation distribution distances. In addition, the similarity between the objectives of ACA and traditional contrastive loss may explain why contrastive learning can learn semantic-related embeddings -they embed samples into spaces that partially preserve augmentation distributions. Experiments on synthetic and real-world datasets demonstrate that our ACA achieves competitive results against various traditional contrastive learning methods. Our contributions are as follows: • We propose a new self-supervised strategy, which measures sample-wise similarity via the similarity of augmentation distributions. This new aspect facilitates learning embeddings. • We propose ACA method that implicitly employs the dimension reduction over the augmentation feature, and the learned embeddings preserve augmentation similarity between samples. • Benefiting from the resemblance to contrastive loss, our ACA helps explain the functionality of contrastive learning and why they can learn semantically meaningful embeddings.

2. RELATED WORK

Self-Supervised Learning. Learning effective visual representations without human supervision is a long-standing problem. Self-supervised learning methods solve this problem by creating supervision from the data itself instead of human labelers. The model needs to solve a pretext task before it is used for the downstream tasks. For example, in computer vision, the pretext tasks include colorizing grayscale images (Zhang et al., 2016) , inpainting images (Pathak et al., 2016) , predicting relative patch (Doersch et al., 2015) , solving jigsaw puzzles (Noroozi & Favaro, 2016 ), predicting rotations (Gidaris et al., 2018) and exploiting generative models (Goodfellow et al., 2014; Kingma & Welling, 2014; Donahue & Simonyan, 2019) . Self-supervised learning also achieves great success in natural language processing (Mikolov et al., 2013; Devlin et al., 2019) . Contrastive Learning and Non-Contrastive Methods. Contrastive approaches have been one of the most prominent representation learning strategies in self-supervised learning. Similar to the metric learning in supervised scenarios (Ye et al., 2019; 2020) , these approaches maximize the agreement between positive pairs and minimize the agreement between negative pairs. Positive pairs are commonly constructed by co-occurrence (van den Oord et al., 2018; Tian et al., 2020a; Bachman et al., 2019) or augmentation of the same sample (He et al., 2020; Chen et al., 2020a; c; Li et al., 2021; Ye et al., 2023) , while all the other samples are taken as negatives. Most of these methods employ the InfoNCE loss (van den Oord et al., 2018) , which acts as a lower bound of mutual information between



Figure 1: Left: semantically similar samples (e.g., those in the same class) usually create similar augmentations. The right figure indicates the same class images have higher averaged augmentation overlaps than those from different classes on four common datasets. For this reason, we learn embeddings by preserving the similarity between augmentation distributions of samples.

availability

Code available at https://github.com/hanlu-nju

