STRUCTURAL ADVERSARIAL OBJECTIVES FOR SELF-SUPERVISED REPRESENTATION LEARNING

Abstract

Within the framework of generative adversarial networks (GANs), we propose objectives that task the discriminator with additional structural modeling responsibilities. In combination with an efficient smoothness regularizer imposed on the network, these objectives guide the discriminator to learn to extract informative representations, while maintaining a generator capable of sampling from the domain. Specifically, we influence the features produced by the discriminator at two levels of granularity. At coarse scale, we impose a Gaussian assumption encouraging smoothness and diversified representation, while at finer scale, we group features forming local clusters. Experiments demonstrate that augmenting GANs with these self-supervised objectives suffices to produce discriminators which, evaluated in terms of representation learning, compete with networks trained by state-of-the-art contrastive approaches. Furthermore, operating within the GAN framework frees our system from the reliance on data augmentation schemes that is prevalent across purely contrastive representation learning methods.

1. INTRODUCTION

Unsupervised feature learning algorithms aim to directly learn representations from data without reliance on annotations, and have become crucial to efforts to scale vision and language models toward handling real-world complexity. Many state-of-the-art approaches adopt a contrastive selfsupervised framework, wherein a deep neural network is tasked with mapping augmented views of a single example to nearby positions in a high-dimension embedding space, while separating embeddings of different examples (Wu et al., 2018; He et al., 2020; Chen et al., 2020; Chen & He, 2021; Grill et al., 2020; Zbontar et al., 2021) . Though requiring no annotation, and hence unaffected by assumptions baked into any labeling procedure, the invariances learned by these models are still influenced by human-designed heuristic procedures for creating augmented views. The recent prominence of contrastive approaches was both preceded by and continues alongside a focus on engineering domain-relevant proxy tasks for self-supervised learning. For computer vision, examples include learning geometric layout (Doersch et al., 2015) , colorization (Zhang et al., 2016; Larsson et al., 2017), and inpainting (Pathak et al., 2016; He et al., 2022) . Basing task design on domain knowledge may prove effective in increasing learning efficiency, but strays further from an alternative goal of developing truly general and widely applicable unsupervised learning techniques. A third family of approaches, coupling data generation with representation learning, may provide a path toward such generality while also escaping dependence upon the hand-crafted elements guiding data augmentation or proxy task design. Generative adversarial networks (GANs) (Goodfellow et al., 2020) and variational autoencoders (VAEs) (Kingma & Welling, 2013) are prime examples within this family. Considering GANs, one might expect the discriminator to act as an unsupervised representation learner, driven by the need to model the real data distribution in order to score the generator's output. Indeed, prior work finds that some degree of representation learning occurs within discriminators in a standard GAN framework (Radford et al., 2015) . Yet, to improve generator output quality, limiting the capacity of the discriminator appears advantageous (Arjovsky et al., 2017) -a choice potentially in conflict with representation learning. Augmenting the standard GAN framework to separate encoding and discrimination responsibility into different components (Donahue et al., 2017; Dumoulin et al., 2017) , along with scaling to larger models (Donahue & Simonyan, 2019), is a promising path to circumventing this apparent limitation. However, it has been unclear whether the struggle to utilize vanilla GANs as effective representation learners stems from inherent limitations of the framework. We provide evidence to the contrary, through an approach that significantly improves representations learned by the discriminator, while maintaining generation quality and operating with a standard pairing of generator and discriminator components. Our approach only modifies the training objectives within the GAN framework, with two aims: (1) regularize the smoothness of the discriminator while maintaining adequate capacity, and (2) require the discriminator to model additional structure of the real data distribution. To control discriminator smoothness, we propose an efficient regularization scheme that approximates the spectral norm of the Jacobian. Specifically, it instantiates the matrix-vector subroutine in a power iteration with an efficient vector-Jacobian product protocol, preventing the need to compute the full Jacobian matrix of the neural network. To prompt the discriminator to learn embeddings that model the data distribution, we propose adversarial objectives resembling a contrastive self-supervised clustering target. These objectives influence the output of the discriminator at two levels of granularity. At a coarse level, we make a Gaussian assumption aiming for diversified and smooth representation. At a more refined scale, we adopt a mean shift clustering objective to group features. Though imposing attractive and repulsive forces onto embeddings produced by the discriminator (Figure 1 ), we do not rely on data augmentation to drive learning. Instead, both real and fake data participate in an adversarial clustering game. Experiments on representation learning benchmarks show our method achieves competitive performance with recent state-of-the-art contrastive self-supervised learning approaches, even though we do not leverage information from (or even have a concept of) an augmented view. We also demonstrate that supplementing a GAN with our proposed objectives not only enhances the discriminator as a representation learner, but also improves the quality of samples produced by the generator.

2.1. GENERATIVE FEATURE LEARNING

Much research on GANs has focused on improving the quality of generated data, yielding significant recent advances (Karras et al., 2017; 2019; 2020; 2021; Sauer et al., 2022) . Other efforts have focused on evolving capabilities, including conditional and controllable generation, e.g., text to im-



Figure1: Within the GAN framework, we illustrate our proposed structural adversarial objectives, which push the discriminator to learn good feature representations. We combine objectives from two levels of granularity. Top right: At a coarse scale, we model features under the Gaussian assumption, and adversarially orient generated embedding (orange) towards the real embedding (blue), using the sample mean µ and covariance Σ. Bottom right: At a finer scale, we implement clustering objectives, which group adjacent real embeddings to form a cluster and adversarially attract the generated embedding towards the nearby cluster center. Bottom left: Learned representations on CIFAR-10 show a distribution consistent with categorization.

