AUGMENTATION COMPONENT ANALYSIS: MODELING SIMILARITY VIA THE AUGMENTATION OVERLAPS

Abstract

Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution, and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show that our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.

1. INTRODUCTION

The rapid development of contrastive learning has pushed self-supervised representation learning to unprecedented success. Many contrastive learning methods surpass traditional pretext-based methods by a large margin and even outperform representation learned by supervised learning (Wu et al., 2018; van den Oord et al., 2018; Tian et al., 2020a; He et al., 2020; Chen et al., 2020a; c) . The key idea of self-supervised contrastive learning is to construct views of samples via modern data augmentations (Chen et al., 2020a) . Then discriminative embeddings are learned by pulling together views of the same sample in the embedding space while pushing apart views of others. Contrastive learning methods utilize the semantic invariance between views of the same sample, but the semantic relationship between samples is ignored. Instead of measuring the similarity between certain augmented views of samples, we claim that the similarity between the augmentation distributions of samples can reveal the sample-wise similarity better. In other words, semantically similar samples have similar sets of views. As shown in Figure 1 left, two images of deer create many similar crops, and sets of their augmentation results, i.e., their distributions, overlap much. In contrast, a car image will rarely be augmented to the same crop as a deer, and their augmentation distributions overlap little. In Figure 1 right, we verify the motivation numerically. We approximate the overlaps between image augmentations with a classical image matching algorithm (Zitova & Flusser, 2003) , which counts the portion of the key points matched in the raw images. We find samples of the same class overlap more than different classes on average, supporting our motivation. Therefore, we establish the semantic relationship between samples in an unsupervised manner based on the similarity of augmentation distributions, i.e., how much they overlap. In this paper, we propose to describe data directly by their augmentation distributions. We call the feature of this kind the augmentation feature. The elements of the augmentation feature represent

availability

Code available at https://github.com/hanlu-nju

