UNSUPERVISED LEARNING OF STRUCTURED REPRE-SENTATIONS VIA CLOSED-LOOP TRANSCRIPTION

Abstract

This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes. While most existing unsupervised learning approaches focus on a representation for only one of these two goals, we show that a unified representation can enjoy the mutual benefits of having both. Such a representation is attainable by generalizing the recently proposed closed-loop transcription framework, known as CTRL, to the unsupervised setting. This entails solving a constrained maximin game over a rate reduction objective that expands features of all samples while compressing features of augmentations of each sample. Through this process, we see discriminative low-dimensional structures emerge in the resulting representations. Under comparable experimental conditions and network complexities, we demonstrate that these structured representations enable classification performance close to state-of-the-art unsupervised discriminative representations, and conditionally generated image quality significantly higher than that of state-of-the-art unsupervised generative models.

1. INTRODUCTION

In the past decade, we have witnessed an explosive development in the practice of machine learning, particularly with deep learning methods. A key driver of success in practical applications has been marvelous engineering endeavors, often focused on fitting increasingly large deep networks to input data paired with task-specific sets of labels. Brute-force approaches of this nature, however, exert tremendous demands on hand-labeled data for supervision and computational resources for training and inference. As a result, an increasing amount of attention has been directed toward using selfsupervised or unsupervised techniques to learn representations that can not only learn without human annotation effort, but also be shared across downstream tasks. Discriminative versus Generative. Tasks in unsupervised learning are typically separated into two categories. Discriminative ones frame high-dimensional observations as inputs, from which lowdimensional class or latent information can be extracted, while generative ones frame observations as generated outputs, which should often be sampled given some semantically meaningful conditioning. Unsupervised learning approaches targeted at discriminative tasks are mainly based on a key idea: to pull different views from the same instance closer while enforcing a non-collapsed representation by either contrastive learning techniques (Chen et al., 2020b; He et al., 2020; Grill et al., 2020a) , covariance regularization methods (Bardes et al., 2021; Zbontar et al., 2021) , or using architecture design (Chen & He, 2020; Grill et al., 2020b) . Their success is typically measured by the accuracy of a simple classifier (say a shallow network) trained on the representations that they produce, which have progressively improved over the years. Representations learned from these approaches, however, do not emphasize much about the intrinsic structure of the data distribution, and have not demonstrated success for generative purposes. In parallel, generative methods like GANs (Goodfellow et al., 2014) and VAEs (Kingma & Welling, 2013) have also been explored for unsupervised learning. Although generative methods have made striking progress in the quality of the sampled or autoencoded data, when compared to the aforementioned discriminative methods, representations learned with these approaches demonstrate inferior performance in classification. Toward A Unified Representation? The disparity between discriminative and generative approaches in unsupervised learning, contrasted against the fundamental goal of learning representations that are useful across many tasks, leads to a natural question that we investigate in this paper: in the unsupervised setting, is it possible to learn a unified representation that is effective for both discriminative and generative purposes? Further, do they mutually benefit each other? Concretely, we aim to learn a structured representation with the following two properties: 1. The learned representation should be discriminative, such that simple classifiers applied to learned features yield high classification accuracy. 2. The learned representation should be generative, with enough diversity to recover raw inputs, and structure that can be exploited for sampling and generating new images. The fact that human visual memory serves both discriminative tasks (for example, detection and recognition) and generative or predictive tasks (for example, via replay) (Keller & Mrsic-Flogel, 2018; Josselyn & Tonegawa, 2020; Ven et al., 2020) indicates that this goal is achievable. Beyond being possible, these properties are also highly practical -successfully completing generative tasks like unsupervised conditional image generation (Hwang et al., 2021), for example, inherently requires that learned features for different classes be both structured for sampling and discriminative for conditioning. On the other hand, the generative property can serve as a natural regularization to avoid representation collapse. Closed-Loop Transcription via a Constrained Maximin Game. The class of linear discriminative representations (LDRs) has recently been proposed for learning diverse and discriminative features for multi-class (visual) data, via optimization of the rate reduction objective (Chan et al., 2022) . In the supervised setting, these representations have been shown to be be both discriminative and generative if learned in a closed-loop transcription framework via a maximin game over the rate reduction utility between an encoder and a decoder (Dai et al., 2022) . Beyond the standard joint learning setting, where all classes are sampled uniformly throughout training, the closed-loop framework has also been successfully adapted to the incremental setting (Tong et al., 2022) , where the optimal multi-class LDR is learned one class at a time. In the incremental (supervised) learning setting, one solves a constrained maximin problem over the rate reduction utility which keeps learned memory of old tasks intact (as constraints) while learning new tasks. It has been shown that this new framework can effectively alleviate the catastrophic forgetting suffered by most supervised learning methods. Contributions. In this work, we show that the closed-loop transcription framework proposed for learning LDRs in the supervised setting (Chan et al., 2022) can be adapted to a purely unsupervised setting. In the unsupervised setting, we only have to view each sample and its augmentations as a "new class" while using the rate reduction objective to ensure that learned features are both invariant to augmentation and self-consistent in generation; this leads to a constrained maximin game that is similar to the one explored for incremental learning (Tong et al., 2022) . Our overall approach is illustrated in Figure 1 . As we experimentally demonstrate in Section 4, our formulation benefits from the mutual benefits of both discriminative and generative properties. It largely bridges the gap between two formerly distinct set of methods: by standard metrics and under comparable experimental conditions, it enables classification performance close to discriminative methods and unsupervised conditional generative quality significantly higher than state-of-the-art techniques. Coupled with evidence from prior work, this suggests that the closed-loop transcription through the (constrained) maximin game between the encoder and decoder has the potential to offer a unifying framework for both discriminative and generative representation learning, across supervised, incremental, and unsupervised settings. 



(Parmar et al., 2021) ✔ ✔ ✗ CTRL-Binary (Dai et al., 2022) ✔ ✔ ✗ SLOGAN (Hwang et al., 2021) ✗ ✔ ✔ U-CTRL (ours) ✔ ✔ ✔

Comparison of the downstream task capabilities of different unsupervised learning methods. UCIG

