ADVERSARIAL PERTURBATION BASED LATENT RECON-STRUCTION FOR DOMAIN-AGNOSTIC SELF-SUPERVISED LEARNING

Abstract

Most self-supervised learning (SSL) methods rely on domain-specific pretext tasks and data augmentations to learn high-quality representations from unlabeled data. Development of those pretext tasks and data augmentations requires expert domain knowledge. In addition, it is not clear why solving certain pretext tasks leads to useful representations. Those two reasons hinder wider application of SSL to different domains. To overcome such limitations, we propose adversarial perturbation based latent reconstruction (APLR) for domain-agnostic self-supervised learning. In APLR, a neural network is trained to generate adversarial noise to perturb the unlabeled training sample so that domain-specific augmentations are not required. The pretext task in APLR is to reconstruct the latent representation of a clean sample from a perturbed sample. We show that representation learning via latent reconstruction is closely related to multi-dimensional Hirschfeld-Gebelein-Rényi (HGR) maximal correlation and has theoretical guarantees on the linear probe error. To demonstrate the effectiveness of APLR, the proposed method is applied to various domains such as tabular data, images, and audios. Empirical results indicate that APLR not only outperforms existing domain-agnostic SSL methods, but also closes the performance gap to domain-specific SSL methods. In many cases, APLR also outperforms training the full network in a supervised manner.

1. INTRODUCTION

Unsupervised deep learning has been highly successful in discovering useful representations in natural language processing (NLP) (Devlin et al., 2019; Brown et al., 2020) and computer vision (Chen et al., 2020; He et al., 2020) . These methods define pretext tasks on unlabeled data so that unsupervised representation learning can be done in a self-supervised manner without explicit human annotations. The success of self-supervised learning (SSL) depends on domain-specific pretext tasks, as well as domain-specific data augmentations. However, the development of semantic-preserving data augmentations requires expert domain knowledge, and such knowledge may not be readily available for certain data types such as tabular data (Ucar et al., 2021) . Furthermore, the theoretical understanding of why certain pretext tasks lead to useful representations remains fairly elusive Tian et al. (2021) . Those two reasons hinder wider applications of SSL beyond the fields of NLP and computer vision. Self-supervised algorithms benefit from inductive biases from domain-specific designs but they do not generalize across different domains. For example, masked language models like BERT (Devlin et al., 2019) are not directly applicable to untokenized data. Although contrastive learning does not require tokenized data, its success in computer vision cannot be easily leveraged in other domains due to its sensitivity to image-specific data augmentations (Chen et al., 2020) . Furthermore, in contrastive learning, the quality of representations degrades significantly without those hand-crafted data augmentations (Grill et al., 2020) . Inspired by denoising auto-encoding (Vincent et al., 2008; 2010; Pathak et al., 2016) , perturbation of natural samples with Gaussian, Bernoulli, and mixup noises (Verma et al., 2021; Yoon et al., 2020) has been utilized as domain-agnostic data augmentations applicable for self-supervised representation learning of images, graphs, and tabular data. However, random noises may not be as effective since uniformly perturbing uninformative features may not lead to the intended goal of augmentations. Specifically, convex combinations in mixup noises (Zhang et al., 2018; Yun et al., 2019) could generate out-of-distribution samples because there is no guarantee that the input data space is convex (Ucar et al., 2021) . In this article, we use generative adversarial perturbation as a semantic-preserving data augmentation method (Baluja & Fischer, 2018; Poursaeed et al., 2018; Naseer et al., 2019; Nakkal & Salzmann, 2021) applicable to different domains of data. Adversarial perturbation is constrained by the ℓ p norm distance to the natural sample so that it is semantic-preserving and does not change the label (Goodfellow et al., 2015; Madry et al., 2018) . With semantic-preserving perturbation, the pretext tasks in domain-agnostic SSL could be reconstruction of clean samples (Yoon et al., 2020) or instance discrimination of perturbed samples (Verma et al., 2021) . Nevertheless, reconstruction of clean samples in the input space is computationally expensive because the input data dimension is high. Therefore, we present adversarial perturbation based latent reconstruction (APLR), a simple and intuitive domain-agnostic self-supervised pretext task derived from linear generative models, to learn representations from unlabeled data in a domainagnostic manner. Contrary to the pretext task of instance discrimination, our method does not require comparison to a large number of negative samples to achieve good performance. The proposed APLR not only achieves strong empirical performance on SSL in various domains, but also has theoretical guarantees on the linear probe error on downstream tasks. The contributions of this article are summarized below: • We present adversarial perturbation based latent reconstruction for domain-agnostic selfsupervised learning. • The proposed APLR achieves strong linear probe performance on various data types without using domain-specific data augmentations. • We provide theoretical guarantees on the linear probe error on downstream tasks.

2. BACKGROUND

Learning representations from two views of an input, x 1 and x 2 , is appealing if the learned representations do not contain the noises in different views. This assumption can be explicitly encoded into the following generative model (Bach & Jordan, 2005) with one shared latent variable z: p(z) = N (0, I) p(x 1 | z) = N ψ ⊤ z, Σ 1 p(x 2 | z) = N η ⊤ z, Σ 2 , where the model parameters ψ, η, Σ 1 and Σ 2 can be learned by maximum likelihood estimation. Reconstruction of input data via maximum likelihood estimation is computationally expensive when the dimension of the input data is high. Instead, the probabilistic generative model can be reinterpreted as latent reconstruction, which has the benefit of direct representation learning while retaining the properties of generative modeling. To convert generative modelling into latent reconstruction, two assumptions need to be met. First, the assumption in generative modeling is that both datasets have similar low-rank approximations. In latent reconstruction, this can be achieved by correlating the pair of latent embeddings ψx 1 and ηx 2 . Secondly, it is assumed in generative modelling that the latent variables follow an isotropic Gaussian distribution p(z) = N (0, I). In latent reconstruction, the covariance matrix of the latent variables is diagonal, meaning that there is no covariance between different dimensions of the latent variable. This orthogonality constraint is equivalent to the assumption of isotropic Gaussian prior and avoids the trivial solution. As a result, by correlating the latent embeddings ψx 1 and ηx 2 and enforcing a diagonal covariance matrix, the properties of generative modelling can be retained in latent reconstruction. The key principle behind latent reconstruction is that the latent representation of x 1 is a good predictor for that of x 2 . Given two datasets X 1 and X 2 of N observations, the projection directions are found by maximizing the regularized correlation between the latent scores of x 1 and x 2 max ψi,ηi Cov(X 1 ψ i , X 2 η i ) 2 γ + (1 -γ)Var(X 1 ψ i ) γ + (1 -γ)Var(X 2 η i ) , where ψ i and η i are the i-th directions of the projection matrices and 0 ≤ γ ≤ 1 is a regularization coefficient (Rosipal & Krämer, 2005; Hardoon et al., 2004) . When γ = 0, it is unregularized canonical correlation analysis (CCA) (Hotelling, 1936) . When γ = 1, it corresponds to partial least squares (PLS) (Wold, 1975; Wold et al., 1984) . Solving the optimization problem in Eq. equation 2 requires singular value decomposition or non-linear iterative methods (NIPLAS algorithm) to make projection directions orthogonal to each other. The computation costs of both methods are prohibitively expensive when the data dimension is high or the number of samples is large. Therefore, it is more desirable to alternatively update ψ and η in an iterative manner (Breiman & Friedman, 1985) .

3. ADVERSARIAL PERTURBATION BASED LATENT RECONSTRUCTION

3.1 LATENT RECONSTRUCTION Let x 1 be a perturbed sample with a certain type of noise, which is adversarial noise in our case. Our pretext task in SSL is to reconstruct the latent representation of the clean sample x 2 from the perturbed sample x 1 . We use deep neural networks ψ(•) and η(•) to project x 1 and x 2 into latent spaces, respectively. The reconstruction in the latent space can be achieved by maximizing the inner product between ψ x 1 and η x 2 , when ψ x 1 and η x 2 have zero mean and unit variance. Based on discussions in Section 2, latent reconstruction must be done with orthogonality constraints to avoid the trivial solution, in which ψ(•) and η(•) projects all input data into a constant vector. Latent reconstruction with orthogonality constraints is equivalent to finding the multi-dimensional Hirschfeld-Gebelein-Rényi (HGR) maximal correlation (Renyi, 1959; Makur et al., 2015) between two random views. It is defined as follows ρ(x 1 ; x 2 ) ≜ sup E ψ(x 1 )ψ(x 1 ) ⊤ =E η(x 2 )η(x 2 ) ⊤ =I E[ψ(x 1 )]=E[η(x 2 )]=0 E ψ x 1 ⊤ η x 2 , where zero mean constraints can be easily satisfied using a batch normalization layer (Ioffe & Szegedy, 2015) and the constraints on identity covariance matrices can be achieved by forcing the off-diagonal elements to be zero. In practice, the constrained optimization problem in Eq. equation 3 is solved by minimizing the following loss L LR = -E x 1 ,x 2 ∈B ψ x 1 ⊤ η x 2 + γ 2 ∥ψ x 1 ψ x 1 ⊤ -I∥ 2 F + ∥η x 2 η x 2 ⊤ -I∥ 2 F ( ) where γ is a Lagrange multiplier, ∥ • ∥ F denotes the Frobenius norm, and B is a mini-batch.

3.2. ADVERSARIAL PERTURBATION

Adversarial perturbation creates input samples that are almost indistinguishable from natural data but causes the deep learning models to make wrong predictions (Szegedy et al., 2014) . We use a generative model to generate adversarial perturbation because it is capable of creating diverse adversarial perturbations very quickly (Baluja & Fischer, 2018; Poursaeed et al., 2018; Naseer et al., 2019; Li et al., 2020) . A generator G is trained to produce an unbounded adversarial G(x 2 ) = δ. The perturbation is then clipped to be within an ϵ bound of x 2 under the ℓ p norm. Let x 1 = x 2 + δ be the perturbed view of the clean sample x 2 . The vast majority of adversarial perturbation methods rely on the classification boundary of the attacked neural network (ψ(•) and η(•)) to train the generator via maximizing a cross-entropy loss. However, it is not possible to obtain the generative adversarial perturbation via maximizing a cross-entropy loss in our case because no label is available. In addition, existing generative adversarial perturbation methods explicitly relying on the classification boundary of the attacked model tend to over-fit to the training data (Nakkal & Salzmann, 2021) . Instead of using a cross-entropy loss, we train G(•) by maximizing the ℓ 2 distance between ψ(x 1 ) and η(x 2 ) L adv = E x 1 ,x 2 ∈B ∥η(x 2 ) -ψ(x 1 )∥ 2 , ( ) where ψ(•) and η(•) are frozen.

3.3. ADVERSARIAL TRAINING

Our model is trained in an adversarial manner. Given a mini-batch of data, we train G(•) by maximizing L adv while freezing ψ(•) and η(•). Then we update the parameters in ψ(•) and η(•) alternatively by minimizing L LR while freezing G(•). This process is illustrated in Fig. 1 and Algorithm 1. x Generate an unbounded adversarial perturbation δ = G(x) ▷ δ has the same shape as x Clip adversarial perturbation δ = ϵδ/|δ| p Obtain the perturbed sample x 1 = x + δ and the clean sample x 2 = x Obtain latent representations ψ(x 1 ) and η(x 2 ) Compute L LR and update ψ(•) using SGD Update η(•) using the exponentially moving average of parameters in ψ(•) end for Generator G + + δ Encoder ψ x 1 Encoder η x 2 L LR L adv Update G Update ψ Momentum Update η

4. THEORETICAL ANALYSIS

Let x be a data sample without perturbation and y(x) be its downstream task label. The quality of the representation ψ(x) is evaluated by the linear probe error, which is the linear classification error of predicting y(x) from z using a linear model parameterized by B ∈ R k×r . Let f B (x) = arg max i∈[r] (ψ(x) ⊤ B) i be the prediction of the linear model. The linear probe error on ψ(x) is defined as Err ψ := min B Pr x∼P (x) [y(x) ̸ = f B (x)], where P (x) is the data distribution. We have to make two assumptions to bound the linear probe error on the learned representations. First, we assume that some universal minimizer of Eq. equation 3 can be realized by ψ(•) and η(•). When the nonlinear mapping to find multi-dimensional HGR maximal correlation is realizable by neural networks, we can analyze the quality of the learned representations using the properties in estimating the HGR maximal correlation. Assumption 4.1 (Realizability). Let H be a hypothesis class containing functions ψ : X 1 → R k and η : X 2 → R k . We assume that at least one of the global minima of L(ψ, η) belongs to H. In addition, it is also reasonable to assume that an optimal classifier f * (•) can predict the label of x almost deterministically under semantic-preserving perturbation. The assumption about the classification error of f * (•) provides a baseline to quantify the linear probe error because part of the error is from approximating f * (•) using a linear model. Assumption 4.2 (α-bounded Error of the Optimal Classifier). Let x be a data sample without perturbation and y(x) be its downstream task label. δ is semantic-preserving perturbation. Then, we assume that there is a classifier f * such that P r(f * (x) ̸ = y(x)) ≤ α and P r(f * (x + δ) ̸ = y(x)) ≤ α. Given assumptions 4.1 and 4.2, we provide the following main theorem on the generalization bound when learning a linear classifier with finite labeled samples on the representations learned by maximizing the HGR maximal correlation. Theorem 4.3. Let ψ * , η * ∈ H be a minimizer of L(ψ, η). The linear classification parameter B is estimated with n 2 i.i.d. random samples {(x i , y i )} n2 i=1 . With probability at least 1ζ over the randomness of data, we have Pr x∼P (x) [y(x) ̸ = f B (x)] ≤ O α 1 -λ k+1 + k n 2 , ( ) where λ k+1 is the k + 1-th Hirschfeld-Gebelein-Rényi maximal correlation between X 1 and X 2 . We hide universal constant factors and logarithmic factors of k in O(•). The first term on the right hand side of Theorem 4.3 guarantees the existence of a linear classifier that achieves a small downstream classification error. It indicates whether the downstream label is linearly separable by the learned representation, thus measuring the expressivity of the learned representation. The second term on the right hand side reveals the sample complexity of learning B from finite labeled samples in the downstream task. It measures the data-efficiency of learning the downstream task using the learned representation. The proof is presented in the Appendix.

5. EXPERIMENTS

We demonstrate the effectiveness of the proposed method by application on three different data domains: tabular data, images, and audios. Additionally, we include ablation studies and sensitivity analysis in Appendix C. For all datasets, we follow the widely used linear evaluation protocol in self-supervised learning as a proxy to examine the quality of the learned representations (He et al., 2020; Chen & He, 2021) . After the feature extractor is pretrained with unlabeled training data, we discard the projection-head, and learn a linear classifier on top of the frozen backbone network using the labeled training data. For tabular and audio experiments, we search the perturbation budget hyperparameter ϵ from the set {0.05, 0.1, 0.15}. For image experiments, we fix ϵ to 0.05 for a direct comparison with Viewmaker networks (Tamkin et al., 2021) . We find that constraining the perturbations to an ℓ 1 norm distance achieves the best results. For all experiments, we train the feature extractor and the adversarial generator in an alternating fashion. The feature extractor ψ(•) is trained with the SGD optimizer with momentum of 0.9 and weight decay of 5e-4. The learning rate is 0.03 without decay. The momentum coefficient in exponential moving average is set to 0.99 when updating η(•). The generator is trained with the Adam optimizer with an initial learning rate of 1e-3 and its architecture is described in the Appendix. Both the feature extractor and the generator are trained for 200 epochs with a batch size of 256. After self-supervised training on unlabeled data, a linear classifier is trained using SGD with a batch size of 256 and no weight decay for 200 epochs. The learning rate starts at 30.0 and is decayed to 0 after 200 epochs with a cosine schedule.

5.1. TABULAR DATA

For tabular data, we follow existing works (Yoon et al., 2020; Verma et al., 2021) and use MNIST and Fashion-MNIST as proxy datasets by flattening the images into 1-dimensional vectors. In addition, we use two real tabular datasets from the UCI repository to evaluate the proposed method (Dua & Graff, 2017) . 1 . APLR outperforms VIME-Self (Yoon et al., 2020) , which corrupts tabular data and uses mask vector estimation and feature vector estimation as pretext tasks, on three out of four datasets. It aligns with the empirical observations that reconstruction of high-dimensional data in the input space is not necessary for learning high-quality representations. APLR outperforms the domain-agnostic benchmark DACL (Verma et al., 2021) , which uses mixup noise, on all four datasets. Mixup noise is less effective than adversarial noise because mixup noise perturbs informative and uninformative dimensions in the input space uniformly. Furthermore, convex combinations in the input space via mixup may result in augmented views off the data manifold. Interestingly, our proposed APLR also outperforms training the full architecture in a supervised manner on the two real tabular datasets, Gas and Gesture.  MNIST

5.2. IMAGE DATA

We use four benchmark image datasets to evaluate the effectiveness of the proposed method, including CIFAR-10/100 (Krizhevsky, 2009) , STL-10 (Coates et al., 2011) and Tiny-ImageNet (Le & Yang, 2015) . The details of the datasets are described in the Appendix. ResNet18 (He et al., 2016 ) is adopted as the backbone network in the feature extractor. We adopt the generator in Nakkal & Salzmann (2021) for image data. We present the results for self-supervised representation learning on image data in Table 2 . It is observed that APLR outperforms DACL (Verma et al., 2021 ) by a large margin, indicating that adversarial noise is a more effective semantic-preserving perturbation than mix-up noise in DACL. Interpolation of input samples via mix-up could lead to out-of-distribution training samples because the input data space may not be convex. Our method also achieves better performance than Viewmaker (Tamkin et al., 2021) , which is a domain-agnostic self-supervised learning method by discriminating adversarially perturbed data. The adversarial noise in APLR is more robust because the training process of adversarial noise in APLR does not rely on the classification boundary between augmented samples (Nakkal & Salzmann, 2021) . Furthermore, we also compare APLR against methods that use image augmentations (e.g. cropping, rotation, horizontal flipping), such as SimCLR (Chen et al., 2020) . It is found in previous studies (Grill et al., 2020; Chen et al., 2020 ) that random crop is a crucial data augmentation towards learning high-quality representations for image data. However, it is impossible to create cropped views of images using adversarial perturbation because the adversarial noise is additive to the natural sample. Given the importance of random crop and the inability to create cropped views with adversarial perturbations, achieving comparable accuracies between APLR and SimCLR indicates that adversarial noise is a highly effective data augmentation method. LibriSpeech-100 is a corpus of read English speech (Panayotov et al., 2015) . We use speaker identification as the downstream classification task. We follow the experimental setup from 

6. RELATED METHODS

Unsupervised representation learning can be categorized into two classes based on the type of the pretext task: generative or discriminative. Generative approaches learn to generate or reconstruct unlabeled data in the input space (Higgins et al., 2017; Donahue et al., 2017; Donahue & Simonyan, 2019) . Reconstructing the masked portion of data is highly successful in discovering useful representations in natural language processing (Mikolov et al., 2013; Brown et al., 2020; Devlin et al., 2019) . Before the success of masked language models, variants of denoising or masked autoencoders are developed for computer vision tasks (Vincent et al., 2008; 2010; Pathak et al., 2016) but the performance is worse than discriminative SSL methods. It was not until recently that masked image models are revived in unsupervised visual representation learning by discretizing image patches via tokenizers (Bao et al., 2022; Zhou et al., 2022) . MAE (He et al., 2022) further simplifies masked image models by directly inpainting masked images without tokenizers or image-specific augmentations. Although masking is a simple data augmentation that can be flexibly applied to different domains of data, computationally expensive generation or reconstruction in the input space may not be necessary for representation learning. Our method is derived from a generative model and shares the idea of reconstructing the corrupted samples in masked autoencoding. Instead of reconstructing discrete tokens or raw inputs, our method reconstructs the continuous latent representation, which is related to discriminative SSL methods using data augmentations. Augmentation-based discriminative SSL methods learn representation by comparing (including but not limited to contrastive learning) augmented views of unlabeled data in the latent space. This line of work involves a contrastive framework with variants of InfoNCE loss (Gutmann & Hyvärinen, 2010; Oord et al., 2018) to pull together representations of augmented views of the same training sample (positive sample) and disperse representations of augmented views from different training samples (negative samples) (Wu et al., 2018; Hénaff et al., 2020; Wang & Isola, 2020) . Typically, contrastive learning methods require a large size of negative samples to learn high-quality representations from unlabeled data (Chen et al., 2020; He et al., 2020) . Meanwhile, non-contrastive methods train neural networks to match the representations of augmented positive pairs without comparison to negative samples or cluster centers. However, non-contrastive methods suffer from trivial solutions where the model maps all inputs to the same constant vector, known as a collapsed representation. Various methods have been proposed to avoid a collapsed representation on an ad hoc basis, such as asymmetric network architecture (Grill et al., 2020) , stop gradient (Chen & He, 2021) , and feature decorrelation (Ermolov et al., 2021; Zbontar et al., 2021; Hua et al., 2021) . Interestingly, our method also includes a feature decorrelation constraint, which is adapted from a generative model. Recently, adversarial perturbations are combined with image augmentations to create more challenging positive and negative samples in self-supervised learning (Ho & Nvasconcelos, 2020; Yang et al., 2022) . APLR does not require domain-specific augmentations and can be applied to different domains of data. Learning augmentations has been investigated in supervised learning to obtain data-dependent augmentation policies for better generalization (Cubuk et al., 2019; Hataya et al., 2020) . In parallel, adversarial perturbation can be treated as a special form of learnable augmentations to enhance the robustness of models with adversarial training (Goodfellow et al., 2015; Madry et al., 2018) . The domain-agnostic augmentations in our method are closely related to generative adversarial perturbation, where data augmentations are obtained through a forward pass of learnable generative models (Li et al., 2020; Baluja & Fischer, 2018; Poursaeed et al., 2018; Naseer et al., 2019) . The vast majority of adversarial perturbation methods rely on the classification boundary of the attacked neural network to train the generator via maximizing a cross-entropy loss. Those ideas have been extended to SSL to get adversarial perturbation by maximizing the InfoNCE loss in SimCLR (Kim et al., 2020; Tamkin et al., 2021) . However, existing generative adversarial perturbation methods rely explicitly on the classification boundary or the instance discrimination boundary of the attacked model and tend to make them over-fit to the source data (Nakkal & Salzmann, 2021) . Instead of maximizing a cross-entropy loss, we maximize the ℓ 2 distance between mid-level feature maps to obtain generative adversarial perturbations. Theoretical understanding of SSL has been studied under the assumption that augmented views of the same raw sample are somewhat conditionally independent given the label or a hidden variable (Arora et al., 2019; Tosh et al., 2021a; b; Lee et al., 2021) . However, those assumptions do not hold in practice because augmented views of a natural sample are usually highly correlated. Augmented views are unlikely to be independent given the hidden label. Recent studies in contrastive learning provide theoretical guarantees of the learned representation without the assumption of conditional independence (HaoChen et al., 2021; Wang et al., 2022) . In parallel, Tian et al. (2021) investigates the training dynamics of non-contrastive SSL methods to show how feature collapse is avoided but lacks guarantees for solving downstream tasks. Note that our proposed method does not involve an explicit comparison between positive and negative samples. Our theoretical analysis relies on the divergence transition matrix without the assumption of conditional independence.

7. CONCLUSIONS

In this article, we introduce APLR, a domain-agnostic SSL method by reconstruction of adversarial perturbed samples in the latent space. The adversarial perturbation is created by a generative network, which is trained concurrently with the feature encoder in an adversarial manner. Our empirical results show that the proposed method is better than the existing domain-agnostic SSL methods and achieves comparable performance with SOTA domain-specific SSL methods. In many cases, APLR also outperforms training the same architecture in a fully supervised manner, demonstrating its strong ability to learn useful latent representations. In addition, the proposed latent reconstruction is linked to Hirschfeld-Gebelein-Rényi maximal correlation and thus has theoretical guarantees of downstream classification tasks. We believe that the proposed method can be applied to applications beyond classification, such as reinforcement learning.

A PROOF OF THE MAIN THEOREM

The HGR maximal correlation can be estimated from divergence transition matrix A ∈ R |X 1 |×|X 2 | , whose entries are defined by the joint and marginal distributions of x 1 and x 2 (Witsenhausen, 1975). Let P x 1 and P x 2 be the marginal distribution and P x 1 x 2 be the joint distribution. P x 1 (x 1 i ) can be viewed as the probability mass of x 1 i being randomly sampled from X 1 . Then, each entry of A is given by A ij = P x 1 x 2 (x 1 i , x 2 j ) P x 1 (x 1 i )P x 2 (x 2 j ) The solution to Eq. equation 3 is the sum of the top k singular values of A with left singular vectors Z 1 ∈ R N ×k and right singular vectors Z 2 ∈ R N ×k defined as z 1 i = P x 1 (x 1 i )ψ(x 1 i ), i = 1, ..., N z 2 i = P x 2 (x 2 i )η(x 2 i ), i = 1, ..., N where z 1 i and z 2 i are the i-th row of embedding matrices Z 1 and Z 2 , respectively (Huang & Xu, 2020) . This is essentially a rank-k approximation of A via minimizing ∥A -Z 1 Z 2 ⊤ ∥ 2 F . Note that x 2 is a clean sample and z 2 is the representation of a clean sample. We use clean samples in downstream tasks. We drop the superscription to avoid cluttered notation. The first term on the right hand side of the main theorem (theorem 4.3) measures the approximation error of the optimal classifier f * by a linear classifier parameterized by B. It amounts to the residual of the least squares problem ∥f * -ZB∥ 2 in Fig. 2 , where the representation matrix Z ∈ R N ×K contains the top-k left singular vectors of A and f * ∈ {0, 1} N is the vector that contains the predicted labels of all the data by the optimal classifier f * . The approximation error is bounded if f * has limited projection into the residual subspace that is perpendicular to the column space of the representation. z B f * (x) Figure 2 : Geometric interpretation of least squares. f * (x) : X → {0, 1} is the Bayes optimal classifier for predicting the label given x with error at most α according to Assumption 4.2. The green panel is the subspace spanned by the columns of the representation z. B is the parameters of the linear classification model. In the first step, we construct a quadratic form of f * to quantify its projection into the residual space based on singular value decomposition of A. The largest singular value of A is 1, with constant left and right singular vectors being 1 and 1 (Huang & Xu, 2020) . Therefore, it is more convenient to subtract the top singular mode and introduce A = I -A. A can be factorized as A = N i=1 γ i u i v ⊤ i via singular value decomposition, where γ i is the i-th singular value of A with the left singular vector u i and the right singular vector v i . The quadratic form is given as follows f * ⊤ Af * = f * ⊤ k i=1 γ i u i v ⊤ i + N i=k+1 γ i u i v ⊤ i f * (10) = f * ⊤ k i=1 γ i u i u ⊤ i + N i=k+1 γ i u i u ⊤ i f * (11) ≥ f * ⊤ N i=k+1 γ i u i u ⊤ i f * (12) ≥ f * ⊤ γ k+1 N i=k+1 u i u ⊤ i f * (13) = γ k+1 f * ⊤ Pf * = γ k+1 f * ⊤ P ⊤ Pf * = γ k+1 ∥Pf * ∥ 2 (14) where Eq. equation 11 is due to the fact that the left and right singular vectors are the same in the symmetric matrix A, the inequality in Eq. equation 12 is because of dropping a quadratic term, and Eq. equation 13 is due to γ k+1 ≤ γ k+2 ≤ . . . γ N . P ≜ N i=k+1 u i u ⊤ i defines a projection matrix that projects f * into a residual subspace spanned by singular vectors u k+1 , . . . , u N . Eq. equation 14 is obtained because P is an idempotent matrix (P 2 = P) (Greene, 2003) . In addition, (I -P)f * is in the subspace spanned by singular vectors u 1 , . . . , u k , which is the column space of Z. Based on the geometric interpretation of the least squares problem ∥f * -ZB∥ 2 , there exists B that such that (I -P)f * = ZB is the projection of f * onto the column space of Z. In the second step, we upper bound γ k+1 ∥Pf * ∥ 2 . Based on Eq. equation 14, we have γ k+1 ∥Pf * ∥ 2 ≤ f * ⊤ Af * . It is more convenient to upper bound f * ⊤ Af * f * ⊤ Af * = f * ⊤ If * -f * ⊤ Af * = N i f * xi f * xi - N i,j=1 P x 1 x 2 (x i , x j )( f * xi P x 1 (x i ) f * xj P x 2 (x j ) ) = 1 2 ( N i f * xi f * xi -2 N i,j=1 P x 1 x 2 (x i , x j )( f * xi P x 1 (x i ) f * xj P x 2 (x j ) ) + N i f * xj f * xj ) = 1 2 ( N i P x 1 (x i ) f * xi P x 1 (x i ) 2 -2 N i,j=1 P x 1 x 2 (x i , x j )( f * xi P x 1 (x i ) f * xj P x 2 (x j ) ) + N j P x 2 (x i ) f * xj P x 2 (x j ) 2 ) = 1 2 N i N j P x 1 x 2 (x i , x j )( f * xi P x 1 (x i ) - f * xj P x 2 (x j ) ) 2 , where f * x = f * (x), P x 1 (x i ) = N j P x 1 x 2 (x i , x j ) = 1/N and P x 2 (x j ) = N i P x 1 x 2 (x i , x j ) = 1/N . Note that we only sample a pair of samples (x i , x j ) where x i is created from semanticpreserving perturbation of x j to train the model. The probability mass P x 1 x 2 (x i , x j ) > 0 only if (x i , x j ) are generated from a shared latent variable. Let (x, x + ) be a positive pair to denote a pair of samples created from semantic-preserving perturbation. We can rewrite equation equation 15 as f * ⊤ Af * = N 2 E x,x + [(f * x -f * x + ) 2 ], where E x,x + [ f * x -f * x + 2 ] quantifies the probability that the optimal classifier f * (•) predicts different labels for (x, x + ). When f * x ̸ = f * x + , there must be f * (x) ̸ = y(x) or f * (x + ) ̸ = y(x). With Assumption 4.2, we have Pr(f * (x) ̸ = f * (x + )) = 2α. As such, the quadratic form in Eq. equation 16 can be upper bounded: f * ⊤ Af * ≤ N α. With Eq. equation 14 and equation 17, we have ∥f * -ZB∥ 2 = ∥Pf * ∥ 2 ≤ N α γ k+1 (18) 1 N ∥f * -ZB∥ 2 ≤ α/γ k+1 The k-th singular value of A is λ k = 1γ k , which is also the k-th Hirschfeld-Gebelein-Rényi maximal correlation by definition. Therefore, we have Pr x∼P (x) [y(x) ̸ = f B (x)] ≤ O α 1-λ k+1 . The second term on the right hand side of the main theorem is the estimation error that measures sample complexity of learning B with access to n 2 i.i.d. training samples in the downstream task. It can be upper bounded using the Rademacher complexity of linear models. Let H 1 = {z → z ⊤ B : ∥B∥ F ≤ C b }. We have the Rademacher complexity of the linear model R n2 (H 1 ) = C b √ C z √ n 2 (20) where  [y(x) ̸ = f B (x)] ≤ O α 1 -λ k+1 + k n 2 .

B IMAGE DATASETS

CIFAR-10/100 are two datasets of tiny natural images with a size of 32 × 32 (Krizhevsky, 2009) . CIFAR-10 and CIFAR-100 have 10 and 100 classes, respectively. Both datasets contain 50,000 training images and 10,000 test images. STL-10 is a 10-class image recognition dataset for unsupervised learning (Coates et al., 2011) . Each class contains 500 labeled training images and 800 test images. In addition, it also contains 100,000 unlabeled training images. Both labeled and unlabeled training images are used for feature extractor pretraining without using labels. The linear classifier is learned using the labeled training images. Tiny-ImageNet is a subset of the ILSVRC-2012 classification dataset (Le & Yang, 2015) 

C ADDITIONAL RESULTS

To understand the effectiveness of adversarial perturbations within APLR, we perform several additional experiments. First, we compare perturbation by adversarial noise against perturbation by Gaussian noise and random masking. For image datasets, we additionally compare the proposed adversarial perturbations against common image augmentations used in supervised learning, including CutMix (Yun et al., 2019 ), RandAugment (Cubuk et al., 2019) , and Random Erasing (Zhong et al., 2017) . Next, we explore the sensitivity of APLR to different perturbation strengths and Lagrange multipliers. Lastly, we compare our framework against SOTA SSL methods on image data.

C.1 ABLATION STUDY

First, for all datasets, we perform ablations to compare perturbations with adversarial noise against Gaussian noise and masking. To obtain a sample augmented with Gaussian noise, we use x 1 = x 2 + δ, where δ ∼ N 0, σ 2 I . For each dataset, we search the standard deviation σ from the set {1, 3, 5, 10} and report the best linear evaluation accuracy. For experiments with masking, we randomly mask a proportion of the clean sample x 2 . We search the proportion of masking from the set {20%, 40%, 50%, 60%, 70%} and report the best linear evaluation accuracy. Tables 4 -6 summarize the results. The adversarial noise outperforms the Gaussian noise and random masking on all datasets, except MNIST. Random noises may not be as effective since uniformly perturbing uninformative features may not lead to the intended goal of augmentations. That is why APLR leads to significant improvement over random perturbations on complex data, such as images and audios. Additionally, we compare adversarial noise against image augmentations in supervised learning, namely CutMix (Yun et al., 2019 ), RandAugment (Cubuk et al., 2019) , and Random Erasing (Zhong et al., 2017) . The results are summarized in Table 7 . Random Erasing results in the worst performance among all methods, while CutMix is on par with Mixup in SSL. This is expected because CutMix performs slightly better or similar to MixUp in supervised learning. RandAugment leads to better performance than CutMix and MixUp because RandAugment contains a wide range of image augmentations. However, RandAugment does not outperform SimCLR. The studies in SimCLR show that careful selection of image augmentations is necessary for good performance in SSL. Some effective image augmentations in supervised learning do not lead to good performance in SSL. 

C.2 SENSITIVITY ANALYSIS

We perform experiments to understand how sensitive APLR is to different strengths of the adversarial perturbation and Lagrange multiplier. We experiment with perturbation strengths of ϵ ∈ {0.05, 0.1, 0.15}, and report the results in Table 8 . The sensitivity analysis indicates that our method is robust to the adversarial perturbation strengths. For the Lagrange multiplier, we experiment with γ ∈ {0.1, 0.5, 0.1} and report the results in Table 9 . We find that our method is robust to γ and achieves strong performance. For APLR, we selected γ = 1 as the default value since it performed well consistently. Our sensitivity analyses indicate that our method is robust to hyperparameters such as ϵ and γ. The proposed APLR achieves strong performance as long as the hyperparameter values are within reasonable ranges.

C.3 APLR AGAINST SOTA IMAGE-SPECIFIC SSL METHODS

We perform an analysis to compare the proposed framework against SOTA SSL methods on images, namely SimCLR (Chen et al., 2020) , Barlow Twins (Zbontar et al., 2021) , and BYOL (Grill et al., 2020) . For this experiment, we use the image augmentations described in SimCLR (Chen et al., 2020) for a fair comparison against image-specific SSL methods. We train each model for 200 epochs and summarize the results in Table 10 . Our method achieves comparable performance to BYOL and Barlow Twins. 



For experiments with VIME-Self, we align the experimental setup with APLR to use the same backbone architecture and perform pretraining on the entire training set. DACL reports better results using larger backbones and more training epochs. We just report the reproduced results under our experiment settings. Following the experimental setup in Tamkin et al. (2021), the training set in LibriSpeech has 251 classes and the testing set has classes. Since we train the fully supervised model in an end-to-end manner without a linear evaluation phase, we cannot report a result for LibriSpeech-100.



Figure1: Illustrative diagram for adversarial perturbation based latent reconstruction. The adversarial generator network takes a clean sample as input and outputs δ, an adversarial perturbation of the same shape. We constrain the perturbation by its ℓ p norm and control its strength with the perturbation budget ϵ. This constrained perturbation allows perturbed samples to appear unchanged to a human evaluator, making the perturbations semantic-preserving.

/Fashion-MNIST are two image datasets of handwritten digits and Zalando's article images, respectively (LeCun et al., 2010; Xiao et al., 2017). The images of size 28 × 28 are flattened into vectors with 784 features. Both datasets have 10 classes, and contain 60,000 training examples and 10,000 test examples.

Linear evaluation accuracy on tabular data

Linear evaluation accuracy on image data We use three benchmark audio datasets to evaluate APLR and describe the dataset details below. ESC-10/50 are two environmental sound classification datasets containing 5-seconds of environmental recordings (Piczak, 2015). ESC-10 and ESC-50 have 10 and 50 classes, and contain 400 and 2000 examples, respectively. We use the original fold settings from the authors (Piczak, 2015), and follow the experimental setup in Al-Tahan & Mohsenzadeh (2021) to use the first fold for testing and the rest for training.

Tamkin et al. (2021) to pretrain with the LibriSpeech-100 hour corpus which contains 28,539 examples, and perform linear evaluation on the LibriSpeech development set which contains 2,703 examples. data are relatively underexplored. Our results demonstrate the advantage of learning audio augmentations over manually designed augmentations. Our proposed method also outperforms both domain-agnostic methods, DACL and Viewmaker. DACL performs close to APLR on the simple yet small ESC-10 and ESC-50 datasets. However, it is unable to learn effective representations on the larger and significantly more complex LibriSpeech-100 dataset. Even though both APLR and Viewmaker use adversarial noise, APLR outperforms Viewmaker by a large margin across all benchmark datasets. This indicates the effectiveness of learning augmentations by maximizing the discrepancy between latent representations.In Table3, we also report results on training the full architecture in a supervised manner. We find that linear classifiers trained on top of the representations learned by APLR outperforms the supervised model on ESC-10, and closes the gap to ESC-50 compared to other benchmarks, demonstrating the ability for APLR to learn useful latent representations. Current state-of-the-art supervised approaches report high accuracies (over 94%) on the ESC-50 dataset(Gong et al., 2021;Kumar & Ithapu, 2020). However, these methods perform pretraining using large datasets such as AudioSet and ImageNet, and use multiple audio-specific data augmentations. With the supervised training experiments, we do not perform pretraining with large datasets, and we use time masking and frequency masking as augmentations. Our goal is to simply compare APLR against training the same architecture in a supervised manner.

E[∥z∥ 2 ] ≤ C z . By definition of Eq. equation 3, E[∥z∥ 2 ] captures the summation of first k HGR maximal correlation. E[∥z∥ 2 ] ≤ k because the HGR maximal correlation less equal than 1.

. It consists of 200 classes, with 500 training images, 50 validation images, and 50 test images in each class. The size of each image is 64 × 64.

Ablation study on tabular data.

Ablation study on image data.

Ablation study on audio data.

Additional ablation study on image data.

Sensitivity to adversarial perturbation strengths. ϵ = 0.05 ϵ = 0.1 ϵ = 0.15

Sensitivity to Lagrange multiplier.

APLR vs. SOTA SSL methods on image data

D VISUALIZATIONS OF ORIGINAL AND PERTURBED SPECTROGRAMS

In Figure 3 , we visualize random spectrograms from LibriSpech-100 and the deltas between the original and perturbed spectrograms. The perturbations are indistinguishable and thus semanticpreserving.Figure 3 : Examples triplets of original spectrograms (left), perturbed spectrograms (middle) and their differences (right) from LibriSpeech-100. The color scales for original and perturbed spectrograms are set to the scale of the original spectrogram. The color scale for the differences is set to -2.5 (red) to + 2.5 (blue), though some values exceed this range. Best viewed when zoomed.

E ADVERSARIAL GENERATOR ARCHITECTURE

The architecture of the generator is described in Table 11 . For experiments on tabular data, we replace the convolution layers with fully connected layers. 

F VISUALIZATIONS OF TRAINING OBJECTIVES DURING TRAINING

To understand how well the final objective in Equation 4 approximates the HGR correlation, we plot the two loss terms over training in Figure 4 . We find that the orthogonal loss approaches zero during training. 

