AUGMENTATION COMPONENT ANALYSIS: MODELING SIMILARITY VIA THE AUGMENTATION OVERLAPS

Abstract

Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution, and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show that our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.

1. INTRODUCTION

The rapid development of contrastive learning has pushed self-supervised representation learning to unprecedented success. Many contrastive learning methods surpass traditional pretext-based methods by a large margin and even outperform representation learned by supervised learning (Wu et al., 2018; van den Oord et al., 2018; Tian et al., 2020a; He et al., 2020; Chen et al., 2020a; c) . The key idea of self-supervised contrastive learning is to construct views of samples via modern data augmentations (Chen et al., 2020a) . Then discriminative embeddings are learned by pulling together views of the same sample in the embedding space while pushing apart views of others. Contrastive learning methods utilize the semantic invariance between views of the same sample, but the semantic relationship between samples is ignored. Instead of measuring the similarity between certain augmented views of samples, we claim that the similarity between the augmentation distributions of samples can reveal the sample-wise similarity better. In other words, semantically similar samples have similar sets of views. As shown in Figure 1 left, two images of deer create many similar crops, and sets of their augmentation results, i.e., their distributions, overlap much. In contrast, a car image will rarely be augmented to the same crop as a deer, and their augmentation distributions overlap little. In Figure 1 right, we verify the motivation numerically. We approximate the overlaps between image augmentations with a classical image matching algorithm (Zitova & Flusser, 2003) , which counts the portion of the key points matched in the raw images. We find samples of the same class overlap more than different classes on average, supporting our motivation. Therefore, we establish the semantic relationship between samples in an unsupervised manner based on the similarity of augmentation distributions, i.e., how much they overlap. In this paper, we propose to describe data directly by their augmentation distributions. We call the feature of this kind the augmentation feature. The elements of the augmentation feature represent the probability of getting a certain view by augmenting the sample as shown in the left of Figure 2 . The augmentation feature serves as an "ideal" representation since it encodes the augmentation information without any loss and we can easily obtain the overlap of two samples from it. However, not only its elements are hard to calculate, but also such high-dimensional embeddings are impractical to use. Inspired by the classical strategy to deal with high-dimensional data, we propose Augmentation Component Analysis (ACA), which employs the idea of PCA (Hotelling, 1933) to perform dimension reduction on augmentation features previously mentioned. ACA reformulates the steps of extracting principal components of the augmentation features with a contrastive-like loss. With the learned principal components, another on-the-fly loss embeds samples effectively. ACA learns operable low-dimensional embeddings theoretically preserving the augmentation distribution distances. In addition, the similarity between the objectives of ACA and traditional contrastive loss may explain why contrastive learning can learn semantic-related embeddings -they embed samples into spaces that partially preserve augmentation distributions. Experiments on synthetic and real-world datasets demonstrate that our ACA achieves competitive results against various traditional contrastive learning methods. Our contributions are as follows: • We propose a new self-supervised strategy, which measures sample-wise similarity via the similarity of augmentation distributions. This new aspect facilitates learning embeddings. • We propose ACA method that implicitly employs the dimension reduction over the augmentation feature, and the learned embeddings preserve augmentation similarity between samples. • Benefiting from the resemblance to contrastive loss, our ACA helps explain the functionality of contrastive learning and why they can learn semantically meaningful embeddings.

2. RELATED WORK

Self-Supervised Learning. Learning effective visual representations without human supervision is a long-standing problem. Self-supervised learning methods solve this problem by creating supervision from the data itself instead of human labelers. The model needs to solve a pretext task before it is used for the downstream tasks. For example, in computer vision, the pretext tasks include colorizing grayscale images (Zhang et al., 2016) , inpainting images (Pathak et al., 2016) , predicting relative patch (Doersch et al., 2015) , solving jigsaw puzzles (Noroozi & Favaro, 2016) , predicting rotations (Gidaris et al., 2018) and exploiting generative models (Goodfellow et al., 2014; Kingma & Welling, 2014; Donahue & Simonyan, 2019) . Self-supervised learning also achieves great success in natural language processing (Mikolov et al., 2013; Devlin et al., 2019) . Contrastive Learning and Non-Contrastive Methods. Contrastive approaches have been one of the most prominent representation learning strategies in self-supervised learning. Similar to the metric learning in supervised scenarios (Ye et al., 2019; 2020) , these approaches maximize the agreement between positive pairs and minimize the agreement between negative pairs. Positive pairs are commonly constructed by co-occurrence (van den Oord et al., 2018; Tian et al., 2020a; Bachman et al., 2019) or augmentation of the same sample (He et al., 2020; Chen et al., 2020a; c; Li et al., 2021; Ye et al., 2023) , while all the other samples are taken as negatives. Most of these methods employ the InfoNCE loss (van den Oord et al., 2018) , which acts as a lower bound of mutual information between views. Based on this idea, there are several methods that attempt to improve contrastive learning, including mining nearest neighbour (Dwibedi et al., 2021; ?; Azabou et al., 2021) and creating extra views by mixing up (Kalantidis et al., 2020) or adversarial training (Hu et al., 2021) . Another stream of methods employs a similar idea of contrastive learning to pull views of a sample together without using negative samples (Grill et al., 2020; Chen & He, 2021) . Barlow Twins (Zbontar et al., 2021) minimizes the redundancy within the representation vector. Tsai et al. (2021) reveals the relationship among Barlow Twins, contrastive and non-contrastive methods. Most of these methods only utilize the semantic invariance of augmentation and ignore the relationship between samples. Different from them, we propose a new way to perform self-supervised learning by preserving the similarity of augmentation distribution, based on the observation that a strong correlation exists between the similarity of augmentation distributions and the similarity of semantics. Explanation of Contrastive Learning. Several works provide empirical or theoretical results for explaining the behavior of contrastive learning. Tian et al. (2020b) ; Xiao et al. (2021) explore the role of augmentation and show contrastive model can extract useful information from views but also can be affected by nuisance information. Zhao et al. (2021) empirically shows that contrastive learning preserves low-level or middle-level instance information. In theoretical studies, Saunshi et al. (2019) provide guarantees of downstream linear classification tasks under conditionally independence assumption. Other works weaken the assumption but are still unrealistic (Lee et al., 2021; Tosh et al., 2021) . HaoChen et al. (2021) focus on how views of different samples are connected by the augmentation process and provide guarantees with certain connectivity assumptions. Wang et al. (2022) notice that the augmentation overlap provides a ladder for gradually learning class-separated representations. In addition to the alignment and uniformity as shown by Wang & Isola (2020) , Huang et al. (2021) develop theories on the crucial effect of data augmentation on the generalization of contrastive learning. Hu et al. (2022) explain that the contrastive loss is implicitly doing SNE with "positive" pairs constructed from data augmentation. Inspired by the important role of augmentation, we provide a novel self-supervised method that ensures preserving augmentation overlap.

3. NOTATIONS

The set of all natural data (data without augmentation) is denoted by X , with size | X | = N . We assume that the natural data follow a uniform distribution p( x) on X , i.e., p( x) =foot_0 N , ∀ x ∈ X . By applying an augmentation method A, a natural sample x ∈ X could be augmented to another sample x with probability p A (x | x), so we use p(• | x) to encode the augmentation distribution. 1 For example, if x is an image, then A can be common augmentations like Gaussian blur, color distortion and random cropping (Chen et al., 2020a) . Denote the set of all possible augmented data as X . We assume X has finite size |X | = L and L > N for ease of exposition. Note that N and L are finite, but can be arbitrarily large. We denote the encoder as f θ , parameterized by θ, which projects a sample x to an embedding vector in R k .

4. LEARNING VIA AUGMENTATION OVERLAPS

As we mentioned in Section 1, measuring the similarity between the augmentation distributions, i.e., the overlap of the augmented results of the two samples reveals their semantic relationship well. For example, in natural language processing, we usually generate augmented sentences by dropping out some words. Then different sentences with similar meanings are likely to contain the same set of words and thus have a high probability of creating similar augmented data. With the help of this self-supervision, we formulate the embedding learning task to meet the following similarity preserving condition: d R k (f θ ⋆ ( x1 ) , f θ ⋆ ( x2 )) ∝ d A (p(• | x1 ), p(• | x2 )) . (1) d R k is a distance measure in the embedding space R k , and d A measures the distance between two augmentation distributions. Equation (1) requires the learned embedding with the optimal parameter θ ⋆ has the same similarity comparison with that measured by the augmentation distributions. In this section, we first introduce the augmentation feature for each sample, which is a manually designed embedding satisfying the condition in Equation (1). To handle the high dimensionality and complexity of the augmentation feature, we further propose our Augmentation Component Analysis (ACA) that learns to reduce the dimensionality and preserve the similarity. Via ACA, our model can learn embeddings that preserve augmentation similarity for natural data.

4.1. AUGMENTATION FEATURE

To reach the goal of similarity preserving in Equation (1), a direct way is to manually construct the feature by the augmentation distributions of each natural sample, i.e., f ( x) = [p(x 1 | x), . . . , p(x L | x)] ⊤ , where each element p(x i | x) represents the probability of getting a certain element x i in space X by augmenting x. We omit θ in f ( x) since such augmentation featurefoot_1 does not rely on any learnable parameters. In this case, any distance d R L defined in the space of f is exactly a valid distribution distance, which reveals the augmentation overlaps and is related to the semantic similarity. Although the constructive augmentation feature naturally satisfies the similarity preserving condition (Equation ( 1)) (because it directly use the augmentation distribution without loss of information), it is impractical for the following reasons. First, its dimensionality is exponentially high, which is up to L, the number of possible augmented results. For example, even on CIFAR10, the small-scale dataset with image size 32 × 32 × 3, L is up to 256 3072 (3072 pixels and 256 possible pixel values). Second, the computation of each element is intractable. We may need an exponentially large number of samples to accurately estimate each p(x | x). The dimensionality and computation problems make the augmentation feature impractical both at inference and training time. Such inconvenience motivates us to (1) conduct certain dimension reduction to preserve the information in low dimensional space (Section 4.2) and (2) develop an efficient algorithm for dimension reduction (Section 4.3).

4.2. DIMENSION REDUCTION ON AUGMENTATION FEATURES

To deal with the high-dimensional property, we employ the idea of PCA (Hotelling, 1933) , which reconstructs the data with principal components. 3 For convenience, we denote the design matrix of augmentation feature by A, where A ∈ R N ×L , A x,x = p(x | x) (see Figure 2 ). We perform PCA on a transformed augmentation feature called normalized augmentation feature: Â = AD -1 2 , (2) where D = diag([d x1 , d x2 , . . . , d x L ]), d x = x p(x | x). Based on normalized augmentation feature, we can develop an efficient algorithm for similarity preserving embeddings.

Assume the SVD of

Â = U ΣV ⊤ with U ∈ R N ×N , Σ ∈ R N ×L , V ∈ R L×L , PCA first learns the projection matrix consisting of the top-k right singular vectors, which can be denoted as Ṽ ∈ R L×k . The vectors in Ṽ are called Principal Components (PCs). Then, it projects the feature by Â Ṽ to get the embeddings for each sample. The overall procedure is illustrated at the top-right of Figure 2 . But performing PCA on the augmentation feature will encounter many obstacles. The element of augmentation feature is not possible to estimate accurately, not to mention its high dimensionality. Even if we can somehow get the projection matrix Ṽ , it is also impractical to project the highdimensional matrix Â. For this reason, we propose ACA to make PC learning and projection process efficient without explicitly calculating elements of augmentation feature.

4.3. AUGMENTATION COMPONENT ANALYSIS

Although there are several obstacles when performing PCA on the augmentation features directly, fortunately, it is efficient to sample from the augmentation distribution p(x | x), i.e., by performing augmentation on the natural data x and get an augmented sample x. Being aware of this, our ACA uses two practical losses to simulate the PCA process efficiently by sampling. The first contrastivelike loss leads the encoder to learn principal components of Â, which can be efficiently optimized by sampling like traditional contrastive methods. The second loss performs on-the-fly projection of Â through the training trajectory, which solves the difficulty of high dimensional projection. Learning principal components. ACA learns the principal components by an efficient contrastivelike loss. Besides its projection functionality, these learned principal components can also serve as embeddings that preserve a kind of posterior distribution similarity, as we will show later. In the SVD view, U Σ serves as the PCA projection results for samples and V contains the principal components (Jolliffe, 2002) . However, if changing our view, V Σ can be seen as the representation of each column. Since each column of Â encodes the probability of the augmented data given natural data, V Σ preserves certain augmentation relationships, as we will show in Theorem 4.2 later. To leverage the extrapolation power of encoders like deep neural networks, we choose to design a loss that can guide the parameterized encoder f θ to learn similar embeddings as PCA. Inspired by the rank minimization view of PCA (Vidal et al., 2016) , we employ the low-rank approximation objective with matrix factorization, similar to HaoChen et al. ( 2021): min F ∈R L×k L mf = ∥ Â⊤ Â -F F ⊤ ∥ 2 F , where columns of F store the scaled version of top-k right singular vectors, and each row can be seen as the embedding of augmented data as will show in Lemma 4.1. According to Eckart-Young-Mirsky theorem (Eckart & Young, 1936) , by optimizing L mf , we can get the optimal F , which has the form Ṽ ΣQ, Q ∈ R k×k is an orthonormal matrix. Σ and Ṽ contains the top-k singular values and right singular vectors. By expanding Equation (3), we get Augmentation Component Analysis Loss for learning Principal Components (ACA-PC) in the following lemma: Lemma 4.1 (ACA-PC loss). Let F x,: = √ d x f ⊤ θ (x), ∀x ∈ X . Minimizing L mf is equivalent to minimizing the following objective: L ACA-PC = -2E x∼p( x), xi∼p(xi| x) xj ∼p(xj | x) f θ (x i ) ⊤ f θ (x j ) + N E x1∼p A (x1),x2∼p A (x2) f θ (x 1 ) ⊤ f θ (x 2 ) 2 . ( ) The proof can be found in Appendix F. In ACA-PC, the first term is the common alignment loss for augmented data and the second term is a form of uniformity loss (Wang & Isola, 2020) . Both terms can be estimated by Monte-Carlo sampling. ACA-PC is a kind of contrastive loss. But unlike most of the others, it has theoretical meanings. We note that the form of ACA-PC differs from spectral loss (HaoChen et al., 2021) by adding a constant N before the uniformity term. This term is similar to the noise strength in NCE (Gutmann & Hyvärinen, 2010) or the number of negative samples in InfoNCE (van den Oord et al., 2018) . It can be proved that the learned embeddings by ACA-PC preserve the posterior distribution distances between augmented data: Theorem 4.2 (Almost isometry for posterior distances). Assume f θ is a universal encoder, σ k+1 is the (k + 1)-th largest singular value of Â, d min = min x d x , and δ x1x2 = I(x 1 = x 2 ), the minimizer θ * of L ACA-P C satisfies: d 2 post (x 1 , x 2 ) - 2σ 2 k+1 d min (1 -δ x1x2 ) ≤ ∥f θ * (x 1 ) -f θ * (x 2 )∥ 2 2 ≤ d 2 post (x 1 , x 2 ) , ∀x 1 , x 2 ∈ X where the posterior distance d 2 post (x 1 , x 2 ) = x∈ X (p A ( x | x 1 ) -p A ( x | x 2 )) 2 (5) measures the squared Euclidean distance between the posterior distribution p x) . We give the proof in Appendix G. Theorem 4.2 states that the optimal encoder for ACA-PC preserves the distance of posterior distributions between augmented data within an error related to embedding size k. As k increase to N , the error decrease to 0. It corresponds to the phenomenon that a larger embedding size leads to better contrastive performance (Chen et al., 2020a) . The posterior distribution p A ( x | x) represents the probability that a given augmented sample x is created by a natural sample x. Augmented data that are only produced by the same natural sample will have the smallest distance, and embeddings of those in overlapped areas will be pulled together by ACA-PC. Since the overlapped area are usually created by two same-class samples, ACA-PC can form semantically meaningful embedding space. A ( x | x) = p(x| x)p( x) p A ( It is also noticeable that the optimal encoder meets the similarity preserving condition (Equation ( 1)) but concerning the posterior distribution for augmented data not the augmentation distribution for natural data. Since what we care about is the distribution of natural data, we further propose a projection loss that helps learn good embeddings for all the natural data. On-the-fly Projection. As stated in the previous part, the learned embeddings by ACA-PC not only serve as certain embeddings for augmented data but also contain principal components of normalized augmentation feature. Based on this, we propose to use these embeddings to act as a projection operator to ensure meaningful embeddings for all the natural data. To be specific, denote the embedding matrix for all augmented data as F aug (∈ R L×k ), where each row F aug x,: = f ⊤ θ * (x). From Equation (3) and Fx,: = √ d x f ⊤ θ * (x) , it can be easily seen that: (Hotelling, 1933) that projects the original feature by the principal components V , we propose to use F aug to project the augmentation feature to get the embeddings for each natural sample. Denote the embedding matrix for natural data as F nat (∈ R N ×k ), where each row F nat x,: F aug = D -1 2 F = D -1 2 Ṽ ΣQ Similar to PCA represents the embeddings of x. We compute F nat as follows: F nat = AF aug = ÂD 1 2 D -1 2 Ṽ ΣQ = ( Ũ Σ) ΣQ, where Σ, Ũ contain the top-k singular values and corresponding left singular vectors. It is noticeable that F nat is exactly the PCA projection result multiplied by an additional matrix ΣQ. Fortunately, such additional linear transformation does not affect the linear probe performance (HaoChen et al., 2021) . With Equation ( 6), the embedding of each natural sample can be computed as follows: F nat x,: = A x,: F aug = x p(x | x)f ⊤ θ * (x) = E x∼p(x| x) f ⊤ θ * (x) which is exactly the expected feature over the augmentation distribution. Similar to Theorem 4.2, the embeddings calculated by Equation ( 7) also present a certain isometry property: Theorem 4.3 (Almost isometry for weighted augmentation distances). Assume f θ is a universal encoder, σ k+1 is the (k + 1)-th largest sigular value of Â,δ x1 x2 = I( x1 = x2 ), let the minimizer of L ACA-P C be θ * and g( x) = E x∼p(x| x) f θ * (x) as in Equation (7), then: d 2 w-aug ( x1 , x2 ) -2σ 2 k+1 (1 -δ x1 x2 ) ≤ ∥g( x1 ) -g( x2 )∥ 2 Σ -2 k ≤ d 2 w-aug ( x1 , x2 ) , ∀x 1 , x 2 ∈ X where ∥•∥ Σ -2 k represent the Mahalanobis distance with matrix Σ -2 k ,Σ k = diag([σ 1 , σ 2 , . . . , σ k ] ) is the diagonal matrix containing top-k singular values and the weighted augmentation distance d 2 w-aug ( x1 , x2 ) = 1 N x∈X (p(x | x1 ) -p(x | x2 )) 2 p A (x) (8) measures the weighted squared Euclidean distance between the augmentation distribution p(x | x). Different from Theorem 4.2, which presents isometry between Euclidean distances in embeddings and augmentation distribution, Theorem 4.3 presents isometry between Mahalanobis distances. The weighted augmentation distances weigh the Euclidean distances by p A (x). d w-aug can be regarded as a valid augmentation distance measure d A as in Equation ( 1) and F nat preserve such a distance. So our goal is to make embeddings of x approaches E p(x| x) f θ ⋆ (x). However, as stated before, the additional projection process is not efficient, i.e., we need exponentially many samples from p(x | x). We notice that samples during the training process of ACA-PC can be reused. For this reason, we propose an on-the-fly projection loss that directly uses the current encoder for projection: L proj = E x∼p( x) ∥f θ ( x) -E p(x| x) f θ (x)∥ 2 2 (9) Full objective of ACA. Based on the discussion of the above parts, ACA simultaneously learns the principal components by ACA-PC and projects natural data by an on-the-fly projection loss. The full objective of ACA has the following form: L ACA-Full = L ACA-PC + αL proj ( ) where α is a trade-off hyperparameter. We also find N in Equation ( 4) too large for stable training, so we replace it with a tunable hyperparameter K. Here, we only display the loss in expectation forms. The details of the implementation are described in Appendix A.

5. A PILOT STUDY

In this section, we experiment with our Augmentation Component Analysis method on a synthetic mixture component data with a Gaussian augmentation method. In this example, we aim to show the relationship between semantic similarity and posterior/weighted augmentation distances. We also show the effectiveness of our method compared to traditional contrastive learning. In this example, the natural data x are sampled from a mixture gaussian with c component: p( x) = c i=1 π i N (µ i , s i I) We use Gaussian noise as the data augmentation of a natural data sample, i.e., A( x) = x + ξ where ξ ∼ N (0, s a I). Concretely, we conduct our experiment on 2-D data with c = 4, π i = 1 c , s i = 1 and µ i uniformly distributed on a circle with radius 2 . For each component, we sample 200 natural data with the index of the component as their label. For each natural datum, we augment it 2 times with s a = 4, which results in totally 1600 augmented data. We compute the augmentation probability for between x and x by p(x | x) and we normalize the probability for each x. First, we plot the distribution of posterior distances (Equation ( 5)) for pairs of augmented data and weighted augmentation distances (Equation ( 8)) for pairs of natural data in Figure 3 left. The two distances appear to have similar distributions because the synthetic data are Gaussian. It can be seen that data from the same component tend to have small distances, while from different components, their distances are large. In low-distance areas, there are pairs of the same class, which means that the two distances are reliable metrics for judging semantic similarity. In all, this picture reveals the correlation between semantic similarity and posterior/weighted augmentation distances. Second, we compare our methods with SimCLR (Chen et al., 2020a) , the traditional contrastive method and Spectral (HaoChen et al., 2021) , which similarly learns embeddings with spectral theory. We test the learned embeddings using a Logistic Regression classifier and report the error rate of the prediction in Figure 3 right. We also report performance when directly using augmentation feature (AF). First, AF has discriminability for simple linear classifiers. SimCLR and Spectral tend to underperform AF as the embedding size increases, while our methods consistently outperform. It may be confusing since our method performs dimension reduction on this feature. But we note that as the embedding size increases, the complexity of the linear model also increases, which affects the generalizability. All the methods in Figure 3 right show degradation of this kind. However, our methods consistently outperform others, which shows the superiority of ACA. Additionally, by adding projection loss, ACA-Full improves ACA-PC by a margin. Additionally, traditional contrastive learning like SimCLR achieves similar performance as our methods. We think it reveals that traditional contrastive learning has the same functionality as our methods.

6. EXPERIMENTS

6.1 SETUP Dataset. In this paper, we conduct experiments mainly on the following datasets with RTX-3090 ×4. CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009) : two datasets containing totally 500K images of size 32 × 32 from 10 and 100 classes respectively. STL-10 ( Coates et al., 2011) : derived from ImageNet (Deng et al., 2009) , with 96 × 96 resolution images with 5K labeled training data from 10 classes. Additionally, 100K unlabeled images are used for unsupervised learning. Tiny Optimizer and other Hyper-parameters. For datasets except for ImageNet, adam optimizer (Kingma & Ba, 2015) is used for all datasets. For CIFAR-10 and CIFAR-100, we use 800 epochs with a learning rate of 3 × 10 -3 . For Tiny ImageNet and STL-10, we train 1,000 epochs with a learning rate 2 × 10 -3 . We use a 0.1 learning rate decay at 100, 50, 20 epochs before the end. Due to hardware resource restrictions, we use a mini-batch of size 512. The weight decay is 1 × 10 -6 if not specified. Following common practice in contrastive learning, we normalize the projected feature into a sphere. For CIFAR-10, we use α = 1. For the rest datasets, we use α = 0.2. By default, K is set to 2. For ImageNet, we use the same hyperparameters as (Chen et al., 2020a) except batch size being 256, α = 0.2 and K = 2. Evaluation Protocol. We evaluate the learned representation on two most commonly used protocols -linear classification (Zhang et al., 2016; Kolesnikov et al., 2019) and k-nearest neighbors classifier (Chen & He, 2021) . In all the experiments, we train the linear classifier for 100 epochs. The learning rate exponentially decays from 10 -2 to 10 -6 . The weight decay is 1 × 10 -6 . We report the classification accuracy on test embeddings as well as the accuracy of a 5-Nearest Neighbors classifier for datasets except for ImageNet.

6.2. PERFORMANCE COMPARISON

In Table 1 , we compare the linear probe performance on various small-scale or mid-scale benchmarks with several methods including SimCLR (Chen et al., 2020a) , BYOL (Grill et al., 2020) , 1 , we can see that our ACA-Full method achieves competitive results on small-or mid-scale benchmarks, achieving either the best or the second-best results on all benchmarks except the 5-NN evaluation on STL-10. Also, ACA-PC differs from ACA-Full in the projection loss. In all the benchmarks, we can see that the projection loss improves performance. For large-scale benchmarks, we compare several methods on ImageNet-100 and ImageNet. On ImageNet-100, we compare our method additionally to MoCo (He et al., 2020) , L align + L unif orm (Wang & Isola, 2020) and InfoMin (Tian et al., 2020b) . Note that the results of the other three methods are reported when using the ResNet-50 encoder, which has more capacity than ResNet18. Our method can also achieve state-of-the-art results among them. This means that our method is also effective with relatively small encoders even on large-scale datasets. On ImageNet, we see that ACA-PC achieves competitive performance against state-of-the-art contrastive methods (Chen et al., 2020a; c; Grill et al., 2020; Chen & He, 2021; HaoChen et al., 2021) and ACA-Full achieves the best.

7. CONCLUSION AND FUTURE WORK

In this paper, we provide a new way of constructing self-supervised contrastive learning tasks by modeling similarity through augmentation overlap, which is motivated by the observation that semantically similar data usually creates similar augmentations. We propose Augmentation Component Analysis to perform PCA on augmentation feature efficiently. Interestingly, our methods have a similar form as the traditional contrastive loss and may explain the ability of contrastive loss. We hope our paper can inspire more thoughts about how to measure similarity in self-supervised learning and how to construct contrastive learning tasks. Future studies may be explorations of applying ACA to learn representations of other forms of instances, such as tasks (Achille et al., 2019) and models (Wu et al., 2023) . A IMPLEMENTATION AND FURTHER DISCUSSION In the previous section, we have presented the expected form of ACA. Thanks to its form, we can efficiently optimize ACA by Monte-Carlo sampling, making the problem tractable. For convenience of illustration, we decompose the ACA-PC loss into two parts, i.e., L ACA-PC = L ali + L uni . For the first part, L ali serves as the alignment loss in traditional contrastive learning, which maximizes the inner product similarity between augmented samples from the same natural data: L ali = 2E x∼p( x), xi∼p(xi| x) xj ∼p(xj | x) -f θ (x i ) ⊤ f θ (x j ) , we use the mini-batch of natural sample to estimate E x∼p( x) . And we just use one sample to estimate E xi∼p(xi| x) and E xj ∼p(xj | x) respectively. This leads to the traditional contrastive learning procedure : sample a mini-batch of natural data, augment it twice, compute and maximize the similarity of two augmented data. For the second part, L uni minimize the inner product similarity of augmented data from the marginal distribution: L uni = N E x1∼p A (x1),x2∼p A (x2) f θ (x 1 ) ⊤ f θ (x 2 ) 2 . ( ) We use the in-batch augmented data to estimate E x1∼p A (x1) . Notably, two augmented samples randomly sampled are hardly augmented by the same natural sample. Therefore, following common practice (Chen et al., 2020a) , we use two augmented data that are created by augmenting two different natural data to compute this term. Additionally, we find that N in L uni is too large to perform stable numerical computation. Thus in our implementation, we replace the N with a tunable noise strength K. For L proj , it is not efficient to fully sample from p(x | x). However, it is notable that: L proj = E x∼p( x) ∥f θ ( x) -E p(x| x) f θ (x)∥ 2 2 = E x∼p( x), xi∼p(xi| x) xj ∼p(xj | x) ∥f θ ( x) - f θ (x i ) + f θ (x j ) 2 ∥ 2 2 . It has the same expectation subscript as L ali . So we can use the same strategy as L ACA-PC and reuse the samples. L proj is computed along with L ali during training, i.e., the principal component learning and projection are done simultaneously. That is why we call L proj "on-the-fly projection". The overall implementation of ACA is illustrated in Figure 4 . And the algorithm is illustrated in Algorithm 1. Discussion on the relation with traditional contrastive learning. As is described in this section. ACA-PC takes a similar form as traditional contrastive learning methods. Similar to them, ACA-PC maximizes the inner product similarity of two views from the sample by Equation ( 11), and minimizes the squared inner product similarity of views from different samples. Note that we have proved that the learned embeddings by ACA-PC function as the principal components of the augmentation feature and preserve the posterior augmentation distances (Theorem 4.2). We believe traditional contrastive loss also has the similar functionality as ours. Due to the strong correlation between augmentation overlap and semantic similarity, this may explain contrastive learning can learn semantically meaningful embeddings even though they ignore the semantic relationship between samples.

Discussion on approximation in implementations.

There are several approximations to stabilize the training. First, we replace the factor N in Equation ( 12) with a tunable noise strength K . Usually, the number of samples is very large in common datasets. When we use a complex model like DNN, it is unstable to involve such a large number in the loss. Therefore, we tune it small and find it works well. But we also note that we use N in our synthetic experiment in Section 5 for the dimensionality and model is not too complex. The superior results proves effectiveness of our theory. Second, we normalize embeddings to project them into a sphere, which equals replacing the inner product with cosine similarity. We find this modification improves the performance from 81.84% to 91.58%. for i = 1, 2, . . . , B do 3: x (1) i = A( xi ), x i = A( xi ) 4: end for 5: L ACA-F ull = -2 B B i=1 f ⊤ θ (x (1) i )f θ (x (2) i ) + K B(B-1) i̸ =j (f ⊤ θ (x (1) i )f θ (x (2) j )) 2 + α∥f θ ( x) - f θ (x (1) i )+f θ (x (2) i ) 2 ∥ 2 6: update θ with w.r.t L ACA-F ull . 7: end for 8: return f θ

B EFFECT OF AUGMENTATION OVERLAPS

Like contrastive learning, our method relies on the quality of augmentation. Therefore, we investigate the influence of different augmentations and reveal the relationship between distribution difference and the linear probe performance on CIFAR10. The augmentation distribution is estimated by augmenting 10 6 times for a subset of random 2000 pairs of samples with the number of intra-class and inter-class pairs being 1000 respectively. Note that as is stated in Section 4.1, even on CIFAR10, the actual value of L is exponentially large (up to 256 3072 ). It is impossible to accurately estimate a distribution over so many possible values. But we notice that for neural networks, many operators can reduce the possible number of values, like convolutions and poolings. Following this observation and to make the computation efficient, we descrete the color into 8-bit for each channel and use a max pooling operation to get a 4 × 4 picture. by this kind of approximation, the number of L reduces to 8 48 . Seems still too large, but it can be noted that the augmentation distribution of each sample covers only a small region. It is enough to estimate the distribution by sampling. For memory restriction, we cannot fully estimate the weighted augmentation distance in Theorem 4.3. Because we cannot store all possible values for p A (x). Instead, we use the Hellinger distance as the distribution distance measure: d 2 H ( x1 , x2 ) = 1 N x∈X p(x | x1 ) -p(x | x2 ) 2 Hellinger distance ranges [0, 2], making the comparison clear. We list the experimented augmentation here: 1. Grayscale: Randomly change the color into gray with probability of 0.1.

2.

HorizontalFlip: Randomly flip horizontally with probability 0.5.

3.. Rotation:

Randomly rotate image with uniformly distributed angle in [0, π] 4. ColorJitter: Jitter (brightness, contrast, saturation, hue) with strength (0.4, 0.4, 0.4, 0.1) and probability 0.8. Table 3 : Histogram (HIST) of distribution distances and linear probe accuracy (ACC) when using different augmentations on CIFAR10. Note that HIST is estimated in the input space. It is property of augmentation, regardless of learning algorithm. We aims to investigate the different augmentation overlaps caused by different augmentation, and reveal its connection between learned model. "Same" denotes the distance between samples with the same semantic class, and "Different" means different classes. The existence of overlap and relationship between intra-/inter-class distances affects the performance. 5. ResizedCrop: Extract crops with a random size from 0.2 to 1.0 of the original area and a random aspect ratio from 3/4 to 4/3 of the original aspect ratio. 6. Augs in SimCLR: Sequential combination of 5,4,1,2.

Grayscale

In Table 3 , we display the histogram (HIST) of intra-and inter-class augmentation distribution distances. ACC displays the linear probe performance on the test set. From the table, the following requirements for a good augmentation can be concluded: (1) Existence of overlap. For the upper three augmentations. The "scope" of augmentation is small. As a result, most of the samples do not overlap. This makes embeddings lack the discriminative ability for downstream tasks. On the contrary, the lower three create overlaps for most of the samples, leading to much better performance. (2) Intra-class distance is lower than inter-class. Compared to ColorJitter, ResizedCrop makes more intra-class samples have lower distance. So ResizedCrop outperforms ColorJitter. SimCLR augmentation surpasses these two for the same reason. Interestingly, we find that the same phenomena appear when using other contrastive methods like SimCLR. It shows that these methods somehow utilize the augmentation overlap like our method.

C PERFORMANCE CURVE

In this section, we illustrate the performance curve throughout training. We aim to demonstrate the functionality of projection loss and show that our ACA method leads to better performance. The compared traditional contrastive learning method is chosen to be SimCLR, for the reason that our method only differs from SimCLR in the loss, with all other things (architecture, optimizer and other shared hyperparameters) identical. Also, we do not introduce extra mechanisms like momentum encoder (BYOL, MoCo) and predictor (BYOL, SimSiam). Although our method is trained with fewer epochs, it achieves competitive results with contrastive learning methods. Notably, it surpasses the 1000-epoch SimCLR which differs from our method only in loss. It shows that the embeddings learned by our method are also transferable to other downstream tasks. We think it is due to the universality of the correlation between augmentation similarity and semantical similarity across these benchmarks. (2015) . We use the code provided at MoCo repositoryfoot_3 with default parameters. All the weights are finetuned on the trainval07+12 set and evaluated on the test07 set. We report an average over 5 runs in Table 5 . Despite the shorter training epochs, our method can achieve better results than SimCLR, especially outperform by a large margin on AP 75 (> 1%). F PROOF OF LEMMA 4.1 For convenient, we define M := Â⊤ Â. The elements of M are: M x1x2 = x∈ X p(x 1 | x)p(x 2 | x) d x1 d x2 , x 1 , x 2 ∈ X Expanding Equation (3), we get: L mf = x1,x2∈X (M x1x2 -F ⊤ x1 F x2 ) 2 = x1,x2∈X (M x1x2 -d x1 d x2 f θ (x 1 ) ⊤ f θ (x 2 )) 2 = const -2 x1,x2∈X d x1 d x2 M x1x2 f θ (x 1 ) ⊤ f θ (x 2 ) + x1,x2∈X d x1 d x2 (f θ (x 1 ) ⊤ f θ (x 2 )) 2 = const -2 x1,x2∈X x∈ X p(x 1 | x)p(x 2 | x)f θ (x 1 ) ⊤ f θ (x 2 ) + x1,x2∈X d x1 d x2 (f θ (x 1 ) ⊤ f θ (x 2 )) 2 multiply by p( x) = 1 N and replace d x with x p(x | x) = N p A (x). The objective becomes: min θ -2 x1,x2∈X x∈ X p(x 1 | x)p(x 2 | x)p( x)f θ (x 1 ) ⊤ f θ (x 2 ) + N x1,x2∈X p A (x 1 )p A (x 2 )(f θ (x 1 ) ⊤ f θ (x 2 )) 2 = -2E x∼p( x), xi∼A(xi| x) xj ∼A(xj | x) f θ (x 1 ) ⊤ f θ (x 2 ) + N E x1∼p A (x1),x2∼p A (x2) (f θ (x 1 ) ⊤ f θ (x 2 )) 2 = L ACA-PC G PROOF OF THEOREM 4.2 As in Appendix F, we define M := Â⊤ Â. By Eckart-Young-Mirsky theorem (Eckart & Young, 1936) , the minimizer F of ∥M -F F ⊤ ∥ 2 F , must have the form V ΣQ, where V , Σ contain the top-k singular values and corresponding right singular vectors of Â, Q ∈ R k×k is some orthonormal matrix with Q ⊤ Q = I. Since we let F x = √ d x f θ (x), then the minimizer θ ⋆ must satisfy f θ ⋆ (x) = Q σ ⊙ v(x) √ d x = Q [σ 1 v 1 (x), σ 2 v 2 (x), . . . , σ k v k (x)] ⊤ √ d x . where ⊙ is the element-wise multiplication. For convenience, we use σ i to denote i-th largest singular value, u i ( x),v i (x) to denote the element of i-th left/right singular value corresponding to x/x . When p ( x) = 1 N , d x = N p A (x) = p A (x) p( x) . Then the posterior distance: d 2 post (x 1 , x 2 ) = x∈ X (p A ( x | x 1 ) -p A ( x | x 2 )) 2 = x∈ X p(x 1 | x)p( x) p A (x 1 ) - p(x 1 | x)p( x) p A (x 1 ) 2 = x∈ X p(x 1 | x) d x1 - p(x 2 | x) d x2 2 = x∈ X Âxx 1 d x1 - Âxx 2 d x2 2 = x∈ X N i=1 σ i u i ( x)v i (x 1 ) d x1 - σ i u i ( x)v i (x 2 ) d x2 2 = x∈ X N i=1 σ i u i ( x)( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 = x∈ X i,i ′ σ i u i ( x)σ i ′ u i ′ ( x)( v i (x 1 ) d x1 - v i (x 2 ) d x2 )( v i ′ (x 1 ) d x1 - v i ′ (x 2 ) d x2 ) = i,i ′ σ i σ i ′ ( v i (x 1 ) d x1 - v i (x 2 ) d x2 )( v i ′ (x 1 ) d x1 - v i ′ (x 2 ) d x2 ) x∈ X u i ( x)u i ′ ( x) (1) = i,i ′ σ i σ i ′ ( v i (x 1 ) d x1 - v i (x 2 ) d x2 )( v i ′ (x 1 ) d x1 - v i ′ (x 2 ) d x2 )δ i,i ′ = N i=1 σ 2 i ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 (1) is due to the orthogonality of singular vectors. Note that: N i=1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 = L i=1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 - L i=N +1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 ≤ L i=1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 = L i=1 v 2 i (x 1 ) d x1 + L i=1 v 2 i (x 2 ) d x2 -2 L i=1 v i (x 1 )v i (x 2 ) d x1 d x2 = 1 d x1 + 1 d x2 - 2δ x1x2 d x1 d x2 ≤ ( 1 d x1 + 1 d x2 )(1 -δ x1x2 ) ≤ 2 d min (1 -δ x1x2 ) (2) can be deduced by considering conditions whether x 1 = x 2 or not. Then:  ∥f θ ⋆ (x 1 ) -f θ ⋆ (x 2 )∥ 2 = k i=1 σ 2 i ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 = d 2 post (x 1 , x 2 ) - N i=k σ 2 i ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 (≤ d 2 post (x 1 , x 2 )) ≥ d 2 post (x 1 , x 2 ) -σ 2 k+1 N i=k+1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 ≥ d 2 post (x 1 , x 2 ) -σ 2 k+1 N i=1 ( v i (x 1 ) d x1 - v i (x 2 ) d x2 ) 2 ≥ d 2 post (x 1 , x 2 ) - σ i u i ( x1 )v i (x) -σ i u i ( x2 )v i (x) 2 = x∈X N i=1 σ i (u i ( x1 ) -u i ( x2 ))v i (x) 2 = x∈X i,i ′ σ i v i (x)σ i ′ v i ′ (x)(u i ( x1 ) -u i ( x2 ))(u i ′ ( x1 ) -u i ′ ( x2 )) = i,i ′ σ i σ i ′ (u i ( x1 ) -u i (x 2 ))(u i ′ ( x1 ) -u i ′ ( x2 )) x∈X v i (x)v i ′ (x) (1) = i,i ′ σ i σ i ′ (u i ( x1 ) -u i ( x2 ))(u i ′ ( x1 ) -u i ′ ( x2 ))δ i,i ′ = N i=1 σ 2 i (u i (x 1 ) -u i (x 2 )) 2 (1) is due to the orthogonality of singular vectors. And g( x) takes the following form: We conduct ablation experiments on the parameter α and K. α is the trade-off parameter between ACA-PC loss and projection loss Equation (10). K act as the noise strength for ACA-PC, which replaces N in Equation (4). g( x) = Q σ 2 1 u 1 (x), Figure 6 shows the effect of α and K on different benchmarks. It can be seen that α is necessary to improve the performance of ACA-PC. A certain value of α helps the model to achieve better results. However, a too large value of α degrades the performance. The same phenomenon is the same on K. 

J COMPARISON OF NEAREST NEIGHBORS

We randomly select 8 samples from the validation set of ImageNet-100 (Tian et al., 2020a) . Then we use the encoder learned by our ACA method and SimCLR (Chen et al., 2020a) to extract features and investigate their nearest neighbors of them. The left-most column displays the selected samples and the following columns show the 5 nearest neighbors. The samples labeled as different classes are marked by the red box. We also annotate the distance between the samples and their nearest neighbors. First, we can see that even though utilizing the augmentation in a different way, ACA achieves similar results as traditional contrastive learning. Both of them can learn semantically meaningful embeddings. However, we can see that ACA tends to learn embeddings that pull together images that are similar in the input space, i.e., creating similar augmentation, while SimCLR sometimes has neighbors that seem different. The images were taken from the ImageNet-100 validation set. Distances between selected samples and their nearest neighbors are annotated above each picture. We can see that the embeddings learned by ACA tend to pull together images that are similar in the input space, i.e., creating similar augmentation. While SimCLR sometimes has neighbors that seem different. G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G ACA G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 6LP&/5



Note that p(• | x) is usually difficult to compute and we can only sample from it. We omit the subscript A and directly use p(• | x) in the following content for convenient Following the common knowledge in dimension reduction, we call the raw high dimensional representation as "feature", and learned low-dimensional representation as "embedding". In this paper, we use the non-centred version(Reyment & Jvreskog, 1996), which is more appropriate for observations than for variables, where the origin matters more. https://github.com/facebookresearch/moco



Figure 1: Left: semantically similar samples (e.g., those in the same class) usually create similar augmentations. The right figure indicates the same class images have higher averaged augmentation overlaps than those from different classes on four common datasets. For this reason, we learn embeddings by preserving the similarity between augmentation distributions of samples.

Figure 2: The idea of learning embeddings via Augmentation Component Analysis (ACA). The upper right figure demonstrates the process of PCA. It learns PCs and projects the input feature to get embeddings of data. Similarly, ACA performs PCA on the augmentation feature, which encodes all the information about the augmentation distribution. To overcome the dimensional and computational complexity, ACA employs ACA-PC loss and projection loss to learn PCs and embeddings tractably.Via ACA, our model can learn embeddings that preserve augmentation similarity for natural data.

Figure 3: Synthetic experiments on mixture Gaussian data with Gaussian noise as augmentation. (a) The posterior distance and weighted augmentation distances among data sampled from the same component and different components. It reveals the correlation between semantic similarity and the two distances, especially when the distance is small. (b) Comparison of linear classification performance among SimCLR, Spectral and our methods with various embedding dimensions ranging from 4 to 200. The dashed line represents the result when directly using Augmentation Feature (AF). ACA-PC outperforms SimCLR and Spectral. ACA-Full further improves.ImageNet: a reduced version of ImageNet(Deng et al., 2009), composed of 100K images scaled down to 64 × 64 from 200 classes. ImageNet-100(Tian et al., 2020a): a subset of ImageNet, with 100-classes. ImageNet(Deng et al., 2009), the large-scale dataset with 1K classes. Network Structure. Following common practice(Chen et al., 2020a;b;c), we use the encoderprojector structure during training, where the projector projects the embeddings into a lowdimensional space. For CIFAR-10 and CIFAR-100, we use the CIFAR variant of ResNet-18(He et al., 2016;Chen & He, 2021) as the encoder. We use a two-layer MLP as the projector whose hidden dimension is half of the input dimension and output dimension is 64. For STL-10 and Tiny ImageNet, only the max-pooling layer is disabled following(Chen & He, 2021;Ermolov et al., 2021). For these two datasets, we use the same projector structure, except that the output dimension is 128. For ImageNet, we use ResNet-50 with the same projector asChen et al. (2020a). Image Transformation. Following the common practice of contrastive learning(Chen et al.,  2020a), we apply the following augmentations sequentially during training: (a) crops with a random size; (b) random horizontal flipping; (c) color jittering; (d) grayscaling. For ImageNet-100 and ImageNet, we use the same implementation as(Chen et al., 2020a). Optimizer and other Hyper-parameters. For datasets except for ImageNet, adam optimizer(Kingma & Ba, 2015) is used for all datasets. For CIFAR-10 and CIFAR-100, we use 800 epochs with a learning rate of 3 × 10 -3 . For Tiny ImageNet and STL-10, we train 1,000 epochs with a learning rate 2 × 10 -3 . We use a 0.1 learning rate decay at 100, 50, 20 epochs before the end. Due to hardware resource restrictions, we use a mini-batch of size 512. The weight decay is 1 × 10 -6 if not specified. Following common practice in contrastive learning, we normalize the projected feature into a sphere. For CIFAR-10, we use α = 1. For the rest datasets, we use α = 0.2. By default, K is set to 2. For ImageNet, we use the same hyperparameters as(Chen et al., 2020a)   except batch size being 256, α = 0.2 and K = 2. Evaluation Protocol. We evaluate the learned representation on two most commonly used protocols -linear classification(Zhang et al., 2016;Kolesnikov et al., 2019) and k-nearest neighbors classifier(Chen & He, 2021). In all the experiments, we train the linear classifier for 100 epochs. The learning rate exponentially decays from 10 -2 to 10 -6 . The weight decay is 1 × 10 -6 . We report the classification accuracy on test embeddings as well as the accuracy of a 5-Nearest Neighbors classifier for datasets except for ImageNet.

Figure 4: The implementation of ACA. Like traditional contrastive learning methods, ACA samples a mini-batch from the whole natural dataset, then performs two random augmentations on the mini-batch. The mini-batch sampling and augmentation create samples from p( x) and p(x | x) respectively. Then we use the samples to estimate the values of L uni ,L uni and L proj as in the figure.

Figure5shows the performance curve along with the projection loss on the CIFAR-10 dataset. The left figure shows the projection loss. We can see that in the early stage of training, the projection loss

i (x 1 ) -u i (x 2 )) 2 i (x 1 ) -u i (x 2 )) 2 (≤ d 2 w-aug ( x1 , x2 )) (x 1 ) -u i (x 2 )) 2 = d 2 w-aug ( x1 , x2 ) -2σ 2 k+1 (1 -δ x1 x2 )I ABLATION STUDY ON PARAMETER α AND K

Figure6: Ablation studies on the effect of α and K. We report linear classification and 5-nearest neighbor accuracy on different datasets with the ResNet-18 encoder. The upper 3 figures illustrate the effect of α on 3 different datasets. The lower 3 figures illustrate the performance v.s. K.

Figure 7: The 5 nearest neighbors of selected samples in the embedding space of ACA and SimCLR.The images were taken from the ImageNet-100 validation set. Distances between selected samples and their nearest neighbors are annotated above each picture. We can see that the embeddings learned by ACA tend to pull together images that are similar in the input space, i.e., creating similar augmentation. While SimCLR sometimes has neighbors that seem different.

|� 𝒙𝒙 2 ) 𝑝𝑝(𝒙𝒙 𝐿𝐿 |� 𝒙𝒙 2 )

Top-1 linear classification accuracy and 5-NN accuracy on four datasets with a ResNet-18 encoder. We use bold to mark the best results and underline to mark the second-best results. ♯ means the results are reproduced by our code.

Left: Top-1 classification accuracy and 5-NN accuracy on ImageNet-100 with ResNet-18. † : results are taken from(Wang & Isola, 2020). ⋆ : results are taken from(Tian et al., 2020b). ♯ means the results are reproduced by our code. Right: Top-1 classification accuracy on ImageNet with ResNet-50, results are taken from(Chen & He, 2021;HaoChen et al., 2021). We use bold to mark the best results and underline to mark the second-best results.

Algorithm 1 Augmentation Component Analysis Algorithm Require: Unlabeled natural dataset {x i } N i=1 ; Augmentation function A ; Encoding model f θ parameterized with θ; projection parameter α; Noise Strength K; Batch size B. 1: for sampled minibatch {x i } B i=1 do

We compare various SSL methods on transfer tasks by training linear layers. Only a single linear layer is trained on top of features. Simple random horizontal flips are used. Results except ours are taken fromKoohpayegani et al. (2021). Our method can achieve competitive results with other contrastive learning methods despite short epochs, especially that it can surpass 1000-epoch SimCLR. It reveals that the natural data will deviate from the center of augmentation distribution. It is harmful to the performance of the model. With the help of projection loss, the embeddings of natural data will be dragged back to their right position, the center. The mid and right figures illustrate the performance curve during training. With only ACA-PC loss, the model can only achieve similar performance during training. But the ACA-Full loss will help improve performance during training. Also, we can see that ACA starts to outperform SimCLR and ACA-PC by a considerable margin from about 50 epochs. This happens to be the epoch in which the projection loss increases to its stable level. Therefore, pulling the natural data to the center of its augmentation helps to learn better embeddings.

All the results other than ACA are taken fromKoohpayegani et al. (2021).

We compare our models on the transfer task of object detction. We find that given a similar computational budget, our method is better than SimCLR, with shorter training time. The models are trained on the VOC trainval07+12 set and evaluated on the test07 set.

ACKNOWLEDGE

This research was supported by NSFC (61773198, 62006112,61921006), Collaborative Innovation Center of Novel Software Technology and Industrialization, NSF of Jiangsu Province (BK20200313)

availability

Code available at https://github.com/hanlu-nju

