ADVERSARIAL PERTURBATION BASED LATENT RECON-STRUCTION FOR DOMAIN-AGNOSTIC SELF-SUPERVISED LEARNING

Abstract

Most self-supervised learning (SSL) methods rely on domain-specific pretext tasks and data augmentations to learn high-quality representations from unlabeled data. Development of those pretext tasks and data augmentations requires expert domain knowledge. In addition, it is not clear why solving certain pretext tasks leads to useful representations. Those two reasons hinder wider application of SSL to different domains. To overcome such limitations, we propose adversarial perturbation based latent reconstruction (APLR) for domain-agnostic self-supervised learning. In APLR, a neural network is trained to generate adversarial noise to perturb the unlabeled training sample so that domain-specific augmentations are not required. The pretext task in APLR is to reconstruct the latent representation of a clean sample from a perturbed sample. We show that representation learning via latent reconstruction is closely related to multi-dimensional Hirschfeld-Gebelein-Rényi (HGR) maximal correlation and has theoretical guarantees on the linear probe error. To demonstrate the effectiveness of APLR, the proposed method is applied to various domains such as tabular data, images, and audios. Empirical results indicate that APLR not only outperforms existing domain-agnostic SSL methods, but also closes the performance gap to domain-specific SSL methods. In many cases, APLR also outperforms training the full network in a supervised manner.

1. INTRODUCTION

Unsupervised deep learning has been highly successful in discovering useful representations in natural language processing (NLP) (Devlin et al., 2019; Brown et al., 2020) and computer vision (Chen et al., 2020; He et al., 2020) . These methods define pretext tasks on unlabeled data so that unsupervised representation learning can be done in a self-supervised manner without explicit human annotations. The success of self-supervised learning (SSL) depends on domain-specific pretext tasks, as well as domain-specific data augmentations. However, the development of semantic-preserving data augmentations requires expert domain knowledge, and such knowledge may not be readily available for certain data types such as tabular data (Ucar et al., 2021) . Furthermore, the theoretical understanding of why certain pretext tasks lead to useful representations remains fairly elusive Tian et al. (2021) . Those two reasons hinder wider applications of SSL beyond the fields of NLP and computer vision. Self-supervised algorithms benefit from inductive biases from domain-specific designs but they do not generalize across different domains. For example, masked language models like BERT (Devlin et al., 2019) are not directly applicable to untokenized data. Although contrastive learning does not require tokenized data, its success in computer vision cannot be easily leveraged in other domains due to its sensitivity to image-specific data augmentations (Chen et al., 2020) . Furthermore, in contrastive learning, the quality of representations degrades significantly without those hand-crafted data augmentations (Grill et al., 2020) . Inspired by denoising auto-encoding (Vincent et al., 2008; 2010; Pathak et al., 2016) , perturbation of natural samples with Gaussian, Bernoulli, and mixup noises (Verma et al., 2021; Yoon et al., 2020) has been utilized as domain-agnostic data augmentations applicable for self-supervised representation learning of images, graphs, and tabular data. However, random noises may not be as effective since uniformly perturbing uninformative features may not lead to the intended goal of augmentations. Specifically, convex combinations in mixup noises (Zhang et al., 2018; Yun et al., 2019) could generate out-of-distribution samples because there is no guarantee that the input data space is convex (Ucar et al., 2021) . In this article, we use generative adversarial perturbation as a semantic-preserving data augmentation method (Baluja & Fischer, 2018; Poursaeed et al., 2018; Naseer et al., 2019; Nakkal & Salzmann, 2021) applicable to different domains of data. Adversarial perturbation is constrained by the ℓ p norm distance to the natural sample so that it is semantic-preserving and does not change the label (Goodfellow et al., 2015; Madry et al., 2018) . With semantic-preserving perturbation, the pretext tasks in domain-agnostic SSL could be reconstruction of clean samples (Yoon et al., 2020) or instance discrimination of perturbed samples (Verma et al., 2021) . Nevertheless, reconstruction of clean samples in the input space is computationally expensive because the input data dimension is high. Therefore, we present adversarial perturbation based latent reconstruction (APLR), a simple and intuitive domain-agnostic self-supervised pretext task derived from linear generative models, to learn representations from unlabeled data in a domainagnostic manner. Contrary to the pretext task of instance discrimination, our method does not require comparison to a large number of negative samples to achieve good performance. The proposed APLR not only achieves strong empirical performance on SSL in various domains, but also has theoretical guarantees on the linear probe error on downstream tasks. The contributions of this article are summarized below: • We present adversarial perturbation based latent reconstruction for domain-agnostic selfsupervised learning. • The proposed APLR achieves strong linear probe performance on various data types without using domain-specific data augmentations. • We provide theoretical guarantees on the linear probe error on downstream tasks.

2. BACKGROUND

Learning representations from two views of an input, x 1 and x 2 , is appealing if the learned representations do not contain the noises in different views. This assumption can be explicitly encoded into the following generative model (Bach & Jordan, 2005) with one shared latent variable z: p(z) = N (0, I) p(x 1 | z) = N ψ ⊤ z, Σ 1 p(x 2 | z) = N η ⊤ z, Σ 2 , where the model parameters ψ, η, Σ 1 and Σ 2 can be learned by maximum likelihood estimation. Reconstruction of input data via maximum likelihood estimation is computationally expensive when the dimension of the input data is high. Instead, the probabilistic generative model can be reinterpreted as latent reconstruction, which has the benefit of direct representation learning while retaining the properties of generative modeling. To convert generative modelling into latent reconstruction, two assumptions need to be met. First, the assumption in generative modeling is that both datasets have similar low-rank approximations. In latent reconstruction, this can be achieved by correlating the pair of latent embeddings ψx 1 and ηx 2 . Secondly, it is assumed in generative modelling that the latent variables follow an isotropic Gaussian distribution p(z) = N (0, I). In latent reconstruction, the covariance matrix of the latent variables is diagonal, meaning that there is no covariance between different dimensions of the latent variable. This orthogonality constraint is equivalent to the assumption of isotropic Gaussian prior and avoids the trivial solution. As a result, by correlating the latent embeddings ψx 1 and ηx 2 and enforcing a diagonal covariance matrix, the properties of generative modelling can be retained in latent reconstruction. The key principle behind latent reconstruction is that the latent representation of x 1 is a good predictor for that of x 2 . Given two datasets X 1 and X 2 of N observations, the projection directions are found by maximizing the regularized correlation between the latent scores of x 1 and x 2 max ψi,ηi Cov(X 1 ψ i , X 2 η i ) 2 γ + (1 -γ)Var(X 1 ψ i ) γ + (1 -γ)Var(X 2 η i ) ,

