UNSUPERVISED LEARNING OF STRUCTURED REPRE-SENTATIONS VIA CLOSED-LOOP TRANSCRIPTION

Abstract

This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes. While most existing unsupervised learning approaches focus on a representation for only one of these two goals, we show that a unified representation can enjoy the mutual benefits of having both. Such a representation is attainable by generalizing the recently proposed closed-loop transcription framework, known as CTRL, to the unsupervised setting. This entails solving a constrained maximin game over a rate reduction objective that expands features of all samples while compressing features of augmentations of each sample. Through this process, we see discriminative low-dimensional structures emerge in the resulting representations. Under comparable experimental conditions and network complexities, we demonstrate that these structured representations enable classification performance close to state-of-the-art unsupervised discriminative representations, and conditionally generated image quality significantly higher than that of state-of-the-art unsupervised generative models.

1. INTRODUCTION

In the past decade, we have witnessed an explosive development in the practice of machine learning, particularly with deep learning methods. A key driver of success in practical applications has been marvelous engineering endeavors, often focused on fitting increasingly large deep networks to input data paired with task-specific sets of labels. Brute-force approaches of this nature, however, exert tremendous demands on hand-labeled data for supervision and computational resources for training and inference. As a result, an increasing amount of attention has been directed toward using selfsupervised or unsupervised techniques to learn representations that can not only learn without human annotation effort, but also be shared across downstream tasks. Discriminative versus Generative. Tasks in unsupervised learning are typically separated into two categories. Discriminative ones frame high-dimensional observations as inputs, from which lowdimensional class or latent information can be extracted, while generative ones frame observations as generated outputs, which should often be sampled given some semantically meaningful conditioning. Unsupervised learning approaches targeted at discriminative tasks are mainly based on a key idea: to pull different views from the same instance closer while enforcing a non-collapsed representation by either contrastive learning techniques (Chen et al., 2020b; He et al., 2020; Grill et al., 2020a) , covariance regularization methods (Bardes et al., 2021; Zbontar et al., 2021) , or using architecture design (Chen & He, 2020; Grill et al., 2020b) . Their success is typically measured by the accuracy of a simple classifier (say a shallow network) trained on the representations that they produce, which have progressively improved over the years. Representations learned from these approaches, however, do not emphasize much about the intrinsic structure of the data distribution, and have not demonstrated success for generative purposes. In parallel, generative methods like GANs (Goodfellow et al., 2014) and VAEs (Kingma & Welling, 2013) have also been explored for unsupervised learning. Although generative methods have made striking progress in the quality of the sampled or autoencoded data, when compared to the aforementioned discriminative methods, representations learned with these approaches demonstrate inferior performance in classification. Toward A Unified Representation? The disparity between discriminative and generative approaches in unsupervised learning, contrasted against the fundamental goal of learning representations that are useful across many tasks, leads to a natural question that we investigate in this paper: in the unsupervised setting, is it possible to learn a unified representation that is effective for both discriminative and generative purposes? Further, do they mutually benefit each other? Concretely, we aim to learn a structured representation with the following two properties: 1. The learned representation should be discriminative, such that simple classifiers applied to learned features yield high classification accuracy. 2. The learned representation should be generative, with enough diversity to recover raw inputs, and structure that can be exploited for sampling and generating new images. The fact that human visual memory serves both discriminative tasks (for example, detection and recognition) and generative or predictive tasks (for example, via replay) (Keller & Mrsic-Flogel, 2018; Josselyn & Tonegawa, 2020; Ven et al., 2020) indicates that this goal is achievable. Beyond being possible, these properties are also highly practical -successfully completing generative tasks like unsupervised conditional image generation (Hwang et al., 2021) , for example, inherently requires that learned features for different classes be both structured for sampling and discriminative for conditioning. On the other hand, the generative property can serve as a natural regularization to avoid representation collapse. Closed-Loop Transcription via a Constrained Maximin Game. The class of linear discriminative representations (LDRs) has recently been proposed for learning diverse and discriminative features for multi-class (visual) data, via optimization of the rate reduction objective (Chan et al., 2022) . In the supervised setting, these representations have been shown to be be both discriminative and generative if learned in a closed-loop transcription framework via a maximin game over the rate reduction utility between an encoder and a decoder (Dai et al., 2022) . Beyond the standard joint learning setting, where all classes are sampled uniformly throughout training, the closed-loop framework has also been successfully adapted to the incremental setting (Tong et al., 2022) , where the optimal multi-class LDR is learned one class at a time. In the incremental (supervised) learning setting, one solves a constrained maximin problem over the rate reduction utility which keeps learned memory of old tasks intact (as constraints) while learning new tasks. It has been shown that this new framework can effectively alleviate the catastrophic forgetting suffered by most supervised learning methods.

Contributions.

In this work, we show that the closed-loop transcription framework proposed for learning LDRs in the supervised setting (Chan et al., 2022) can be adapted to a purely unsupervised setting. In the unsupervised setting, we only have to view each sample and its augmentations as a "new class" while using the rate reduction objective to ensure that learned features are both invariant to augmentation and self-consistent in generation; this leads to a constrained maximin game that is similar to the one explored for incremental learning (Tong et al., 2022) . Our overall approach is illustrated in Figure 1 . As we experimentally demonstrate in Section 4, our formulation benefits from the mutual benefits of both discriminative and generative properties. It largely bridges the gap between two formerly distinct set of methods: by standard metrics and under comparable experimental conditions, it enables classification performance close to discriminative methods and unsupervised conditional generative quality significantly higher than state-of-the-art techniques. Coupled with evidence from prior work, this suggests that the closed-loop transcription through the (constrained) maximin game between the encoder and decoder has the potential to offer a unifying framework for both discriminative and generative representation learning, across supervised, incremental, and unsupervised settings. Two additional constraints are imposed on the Binary-CTRL method proposed in prior work (Dai et al., 2022) : 1) self-consistency for sample-wise features z i and ẑi , say z i ≈ ẑi ; and 2) invariance/similarity among features of augmented samples z i and z i a , say z i ≈ z i a = f (τ (x i ), θ), where x i a = τ (x i ) is an augmentation of sample x i via some transformation τ (•).

2. RELATED WORK

Our work is mostly related to three categories of unsupervised learning methods: (1) self-supervised learning via discriminative models, (2) self-supervised learning via generative models, and (3) unsupervised conditional image generation. Table 1 compares the capabilities of models learned by various representative unsupervised learning methods. Self-Supervised Learning for Discriminative Models. On the discriminative side, works like SimCLR (Chen et al., 2020b) , MoCo (He et al., 2020) , and BYOL (Grill et al., 2020a) have recently shown overwhelming effectiveness in learning discriminative representations of data. MoCo (He et al., 2020) and SimCLR (Chen et al., 2020b) seek to learn features by pulling together features of augmented versions of the same sample while pushing apart features of all other samples, while BYOL (Grill et al., 2020a) trains a student network to predict the representation of a teacher network in a contrastive setting. BarlowTwins (Zbontar et al., 2021) and TCR (Li et al., 2022) learn by regularizing the covariance matrix of the embedding. However, features learned by this class of methods are typically highly compressed, and not designed to be used for generative purposes. Self-Supervised Learning with Generative Models. On the generative side, the original GAN (Goodfellow et al., 2014) can be viewed as a natural self-supervised learning task. With an additional linear probe, works like DCGAN (Radford et al., 2015) have shown that features in the discriminator can be used for discriminative tasks. To further enhance the features, extensions like BiGAN (Donahue et al., 2016) and ALI (Dumoulin et al., 2016) introduce a third network into the GAN framework, aimed at learning an inverse mapping for the generator, which when coupled with labeled images can be used to study and supervise semantics in learned representations. Other works like SSGAN (Chen et al., 2019) , SSGAN-LA (Hou et al., 2021) , and ContraD (Jeong & Shin, 2021) propose to put augmentation tasks into GAN training to facilitate representation learning. Outside of GANs, variational autoencoders (VAEs) have been adapted to generate more semantically meaningful representations by trading off latent channel capacity and independence constraints with reconstruction accuracy (Higgins et al., 2016) , an idea that has also been incorporated into recognition improvements using patch-level bottlenecks (Gupta et al., 2020) , which encourage a VAE to focus on useful patterns in images. By incorporating data-augmentation, VAE is also shown to achieve fair discriminative performance (Falcon et al., 2021) . Recently, works like MAE (He et al., 2021) and CAE (Chen et al., 2022) have learned representations by solving masked reconstruction tasks using vision transformers. Autogressive approaches like iGPT (Chen et al., 2020a) have also demonstrated decent self-supervised learning performance, which improves further with the incorporation of contrastive learning (Kim et al., 2021) . However, unless supervised, features learned by those previously mentioned methods either do not have strong discriminative performance, or cannot be directly exploited to condition the generative task. Unsupervised Conditional Image Generation (UCIG). For generative models, we often want to be able to generate images conditioned on a certain class or style, even in a completely unsupervised setting. This requires that the learned representations have structures that correspond to the desired conditioning. InfoGAN (Chen et al., 2016) proposes to learn interpretable representations by maximizing the mutual information between the observation and a subset of the latent code. ClusterGAN (Mukherjee et al., 2019) assumes a discrete Gaussian prior where discrete variables are defined as a one-hot vector and continuous variables are sampled from Gaussian distribution. Self-Conditioned GAN (Liu et al., 2020) uses clustering of discriminative features as labels to train. SLOGAN (Hwang et al., 2021) proposes a new conditional contrastive loss (U2C) to learn latent distribution of the data. Note that compared to our work, ClusterGAN and SLOGAN introduce an additional encoder that leads to increased computational complexity. On the VAE side, works like VaDE (Jiang et al., 2016) cluster based on the learned feature of a supervised ResNet. Variational Cluster (Prasad et al., 2020) simultaneously learns a prior that captures the latent distribution of the images and a posterior to help discriminate between data points in an end-to-end unsupervised setting. In this work, we will see how clusters can be estimated in a principled way in a more unified framework, by optimizing the same type of objective function that we use for learning features.

3. METHOD

3.1 PRELIMINARIES: RATE REDUCTION AND CLOSED-LOOP TRANSCRIPTION Assumptions on Data. Our work, as well as prior work in closed-loop transcription (Dai et al., 2022; Tong et al., 2022) , considers a set of N images X = [x 1 , x 2 , ..., x N ] ⊂ R D sampled from k classes. Borrowing notation from (Yu et al., 2020) , the membership of the N samples in the k classes is denoted using k diagonal matrices: Π = {Π j ∈ R N ×N } k j=1 , where the diagonal entry Π j (i, i) of Π j is the probability of sample i belonging to subset j. Let Ω . = {Π | Π j = I, Π j ≥ 0.} be the set of all such matrices. WLOG, we may assume that classes are separable, with images for each belonging to a low-dimensional submanifold in the space R D . Unsupervised Discriminative Autoencoding. The goal of transcription is to learn a unified representation, with the structure required to both classify and generate images from these k classes. Concretely, this is achieved by learning two continuous mappings: (1) an encoder parametrized by θ: f (•, θ) : x → z ∈ R d with d ≪ D such that all samples are mapped to their features as X f (x,θ) ----→ Z with Z = [z 1 , z 2 , ..., z N ] ⊂ R d , and (2) an inverse map g(•, η) : z → x ∈ R D such that x and x = g(f (x)) is close. In other words, X f (x,θ) ----→ Z g(z,η) ----→ X forms an autoencoding. In this work, we specifically learn this mapping in an entirely unsupervised fashion, without knowing the ground-truth class labels Π at all. As stated in the introduction, a both discriminative and generative representation is difficult to achieve by standard generative methods like VAEs and GANs. This is one of the motivations for the closed-loop transcription framework (CTRL) proposed by (Dai et al., 2022) , which we will generalize to the unsupervised setting. Maximizing Rate Reduction. The CTRL framework (Dai et al., 2022) was proposed for the supervised setting, where it aims to map each class onto an independent linear subspace. As shown in (Yu et al., 2020) , such a linear discriminative representation (LDR) can be achieved by maximizing a coding rate reduction objective, known as the MCR 2 principle: ∆R Z|Π) . = 1 2 log det I + d N ϵ 2 ZZ ⊤ R(Z) - k j=1 tr(Π j ) 2N log det I + d tr(Π j )ϵ 2 ZΠ j Z ⊤ R c . (1) where each Π j encodes the membership of the N samples described before. As discussed in (Chan et al., 2022) , the first term R(Z) measures the total rate (volume) of all features whereas the second term R c measures the average rate (volume) of the k components. Our work adapts this formula to design meaningful objectives in the unsupervised setting. Closed-Loop Transcription. To learn the autoencoding X f (x,θ) ----→ Z g(z,η) ----→ X, a fundamental question is how we measure the difference between X and the regenerated X = g(f (X)). It is typically very difficult to put a proper distance measure in the image space (Wang et al., 2004) . To bypass this difficulty, the closed-loop transcription framework (Dai et al., 2022) proposes to measure the difference between X and X through the difference between their features Z and Ẑ mapped through the same encoder: X f (x,θ) ------→ Z g(z,η) ------→ X f (x,θ) ------→ Ẑ. The difference can be measured by the rate reduction between Z and Ẑ, a special case of (1) with k = 2 classes: ∆R Z, Ẑ . = R Z ∪ Ẑ - 1 2 R Z) + R Ẑ) . (3) Such a ∆R is a principled distance between subspace-like Gaussian ensembles, with the property that ∆R(Z, Ẑ) = 0 iff Cov(Z) = Cov( Ẑ) (Ma et al., 2007) . As shown in (Dai et al., 2022) , applying this measure in the closed-loop CTRL formulation can already learn a decent autoencoding, even without class information. This is known as the CTRL-Binary program: max θ min η ∆R(Z, Ẑ) However, note that ( 4) is practically limited because it only aligns the dataset X and the regenerated X at the distribution level. There is no guarantee that for each sample x would be close to the decoded x = g(f (x)). For example, (Dai et al., 2022) shows that a car sample can be decoded into a horse; the so obtained (autoencoding) representations are not sample-wise self-consistent!

3.2. SAMPLE-WISE CONSTRAINTS FOR UNSUPERVISED TRANSCRIPTION

To improve discriminative and generative properties of representations learned in the unsupervised setting, we propose two additional mechanisms for the above CTRL-Binary maximin game (4). For simplicity and uniformity, here these will be formulated as equality constraints over rate reduction measures, but in practice they can be enforced softly during optimization. Sample-wise Self-Consistency via Closed-Loop Transcription. First, to address the issue that CTRL-Binary does not learn a sample-wise consistent autoencoding, we need to promote x to be close to x for each sample. In the CTRL framework, this can be achieved by enforcing that their corresponding features z = f (x) and ẑ = f ( x) are the same or close. To promote sample-wise self-consistency, where x = g(f (x)) is close to x , we want the distance between z and ẑ to be zero or small, for all N samples. This can be formulated using rate reduction; note that this again avoids measuring differences in the image space: i∈N ∆R(z i , ẑi ) = 0. (5) Self-Supervision via Compressing Augmented Samples. Since we do not know any class label information between samples in the unsupervised settings, the best we can do is to view every sample and its augmentations (say via translation, rotation, occlusion etc) as one "class" -a basic idea behind almost all self-supervised learning methods. In the rate reduction framework, it is natural to compress the features of each sample and its augmentations. In this work, we adopt the standard transformations in SimCLR (Chen et al., 2020b) and denote such a transformation as τ . We denote each augmented sample x a = τ (x), and its corresponding feature as z a = f (x a , θ). For discriminative purposes, we hope the classifier is invariant to such transformations. Hence it is natural to enforce that the features z a of all augmentations are the same as that z of the original sample x. This is equivalent to requiring the distance between z and z a , measured in terms of rate reduction again, to be zero (or small) for all N samples: i∈N ∆R(z i , z i a ) = 0. 3.3 UNSUPERVISED REPRESENTATION LEARNING VIA CLOSED-LOOP TRANSCRIPTION So far, we know the CTRL-Binary objective ∆R(Z, Ẑ) in (4) helps align the distributions while sample-wise self-consistency (5) and sample-wise augmentation (6) help align and compress features associated with each sample. Besides consistency, we also want learned representations are maximally discriminative for different samples (here viewed as different "classes"). Notice that the rate distortion term R(Z) measures the coding rate (hence volume) of all features. It has been observed in (Li et al., 2022) that by maximizing this term, learned features expand and hence become more discriminative. Unsupervised CTRL. Putting these elements together, we propose to learn a representation via the following constrained maximin program, which we refer to as unsupervised CTRL (U-CTRL): max θ min η R(Z) + ∆R(Z, Ẑ) subject to i∈N ∆R(z i , ẑi ) = 0, and i∈N ∆R(z i , z i a ) = 0. In practice, the above program can be optimized by alternating maximization and minimization between the encoder f (•, θ) and the decoder g(•, η). We adopt the following optimization strategy that works well in practice, which is used for all subsequent experiments on real image datasets: max θ R(Z) + ∆R(Z, Ẑ) -λ 1 i∈N ∆R(z i , z i a ) -λ 2 i∈N ∆R(z i , ẑi ); (8) min η R(Z) + ∆R(Z, Ẑ) + λ 1 i∈N ∆R(z i , z i a ) + λ 2 i∈N ∆R(z i , ẑi ), where the constraints i∈N ∆R(z i , ẑi ) = 0 and i∈N ∆R(z i , z i a ) = 0 in (7) have been converted (and relaxed) to Lagrangian terms with corresponding coefficients λ 1 and λ 2 .foot_0  Unsupervised Conditional Image Generation via Rate Reduction. The above representation is learned without class information. In order to facilitate discriminative or generative tasks, it must be highly structured. As we will see via experiments, specific and unique structure indeed emerges naturally in the representations learned using U-CTRL: globally, features of images in the same class tend to be clustered well together and separated from other classes (Figure 2 ); locally, features around individual samples exhibit approximately piecewise linear low-dimensional structures (Figure 5 ). The highly-structured feature distribution also suggests that the learned representation can be very useful for generative purposes. For example, we can organize the sample features into meaningful clusters, and model them with low-dimensional (Gaussian) distributions or subspaces. By sampling from these compact models, we can conditionally regenerate meaningful samples from computed clusters. This is known as unsupervised conditional image generation (Hwang et al., 2021) . To cluster features, we exploit the fact that the rate reduction framework (1) is inspired by unsupervised clustering via compression (Ma et al., 2007) , which provides a principled way to find the membership Π. Concretely, we maximize the same rate reduction objective (1) over Π, but fix the learned representation Z instead. We simply view the membership Π as a nonlinear function of the features Z, say h π (•, ξ) : Z → Π with parameters ξ. In practice, we model this function with a simple neural network, such as an MLP head right after the output feature z. To estimate a "pseudo" membership Π of the samples, we solve the following optimization problem over Π: Π = arg max ξ ∆R(Z|Π(ξ)). ( ) Experiments in Section 4.2 demonstrate that conditional image generation from clusters produced in this manner result in high-quality images that are highly similar in style.

4. EXPERIMENTS

We now evaluate the performance of the proposed U-CTRL framework and compare it with representative unsupervised generative and discriminative methods. The first set of experiments (Section 4.1 show that despite being a generative method in nature, U-CTRL can learn discriminative representations competitive with state-of-the-art discriminative methods. The second set (Section 4.2) show that the learned generative representation can significantly boost the performance of unsupervised conditional image generation. Finally, the third set (Section 4.3) study the advantages that generative represeentations have over discriminative ones. We conduct experiments on the following datasets: CIFAR-10 ( Krizhevsky et al., 2014) , CIFAR-100 (Krizhevsky et al., 2009) , and Tiny ImageNet (Deng et al., 2009) . Standard augmentations for self-supervised learning are used across all datasets (Chen et al., 2020b) . We design all experiments to ensure that comparisons against U-CTRL are fair. For all methods that we compare against, we ensure that experiments are conducted with similar model sizes. If code for similar size structure can not be found, we uniformly use ResNet-18 to reproduce results for baselines, which is larger than the network used by our method. Details about network architectures and the experimental setting are given in Appendix A. All methods have runned 400 epochs or equivalent iterations (because generative models often count in iteration). Method CIFAR-10 CIFAR-100 Tiny-ImageNet Accuracy Accuracy Accuracy GAN based methods SSGAN-LA (Hou et al., 2021) 0.803 0.543 0.344 DAGAN+ (Antoniou et al., 2017) 0.772 0.519 0.224 ContraD (Jeong & Shin, 2021) 0.852 0.514 -VAE based methods PATCH-VAE (Parmar et al., 2021) 0.471 0.325 β-VAE (Higgins et al., 2016) 0.531 0.315 -CTRL based methods CTRL-Binary (Dai et al., 2022) 0.599 --U-CTRL (ours) 0.874 0.552 0.360 

4.1. DISCRIMINATIVE QUALITY OF LEARNED REPRESENTATIONS

To evaluate the discriminative quality of the learned representations, we follow the standard practice of evaluating the accuracy of a simple linear classifier trained on the learned representation. Table 2 compares our method against SOTA generative self-supervised learning methods, and Table 3 compares our method against SOTA discriminative self-supervised methods. Experimental and training details are given in Appendix A. Quantitative Comparisons of Classification Performance. From Table 2 , we observe that on all chosen datasets, our method achieves substantial improvements compared to existing generative self-supervised learning methods. This includes more complex datasets like CIFAR-100 and Tiny-ImageNet, where we surpass the current SOTA models. From Table 3 , our method achieves similar performance compared to SOTA discriminative self-supervised models. These results echo our goal of seeking a more unifed generative and discriminative representations: despite resembling a generative method architecturally, our method still produces highly discriminative representations. In addition, these results lead us to ask a fundamental question: when is incorporating both discriminative and generative properties greater than seperately handling these two parts? We provide preliminary answers in Section 4.3. Qualitative Visualization of Learned Representations. To explain the classification performance of our method, we visualize the incoherence between features learned for the training datasets. Figure 2 shows cosine similarity heatmaps between the learned features, organized by ground-truth class labels. A block-diagonal pattern emerges automatically from U-CTRL training for all three datasets, similar to those observed in features learned in a supervised setting (Dai et al., 2022) . In this case, however, these blocks emerge and correspond with classes labels despite the absence of any supervision at all.

4.2. IMPROVED UNSUPERVISED CONDITIONAL GENERATION QUALITY

To evaluate the quality of unsupervised conditional image generation, we measure performance on two axes: cluster quality and image quality. We estimate clusters by optimizing (10), and show results and comparisons with both recent and classical methods in Table 4 . Training details of our method for the additional MLP head can be found in the Appendix A. Cluster Quality. We measure normalized mutual information (NMI) and clustering accuracy for cluster quality on CIFAR-10 clustered into 10 classes and CIFAR-100(20), which is clustered into 20 super-classes. From Table 4 , we observe that on CIFAR-10, U-CTRL results in an NMI that is almost double that of the existing SOTA on both GAN-based and VAE-based methods, with significantly improved clustering accuracy. Unlike many baselines, we also demonstrate that our method scales to the more challenging CIFAR-100( 20) dataset, where it also significantly outperforms alternatives. Our improved clustering quality suggests potential for improving unsupervised conditional image generation, which relies on first finding statistically (and hence visually) meaningful clusters. Image Quality. We use Frechet Inception Distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) to measure image quality. From Table 4 , it is evident that U-CTRL maintains competitive image quality compared to other methods, measured both by FID and IS. We also compare original images against reconstructed ones in Figure 3 , where we see that the original X is very similar to the reconstructed X; U-CTRL indeed achieves very good sample-wise self-consistency. Unsupervised Conditional Image Generation. In Figure 4 , we visualize images generated from the ten unsupervised clusters from (10). Each block represents one cluster and each row represents one principal component for each cluster. Despite learning and training without labels, the model not only organizes samples into correct clusters, but is also able to preserve statistical diversities within each cluster/class. We can easily recover the diversity within each cluster by computing different principal components and then sample and generate accordingly! More detailed illustrations with more samples is provided in Appendix B.

4.3. BENEFITS OF U-CTRL'S STRUCTURED REPRESENTATION

As shown in the previous section, on datasets like CIFAR-10, CIFAR-100, and Tiny-ImageNet, our framework is able to achieve representation quality close to with the best discriminative selfsupervised learning methods. A clear advantage of this is computational efficiency; only a single representation needs to be trained for a much broader set of tasks. This subsection aims to provide additional insights on how a unified model can be more beneficial for a broader range of tasks. Domain Transfer. Regenerating images is demanding on the encoder, which is required to produce a more informative representation than contrastive training would. We hypothesize that the encoder trained with generative task may retain more information about the image and allow the representation to generalize better. To verify this, we compare the accuracy on CIFAR-100 using models learned from CIFAR-10 in Table 5 . When compared to purely discriminative self-supervised learning models, we observe that U-CTRL is 4 percent better than other methods on classification accuracy. Visualization of Emerged Structures. The representations learned by U-CTRL are significantly different from those learned from previous either discriminative and generative methods. To illustrate this, we use t-SNE (Van der Maaten & Hinton, 2008) to visualize the learned representation in 2D. Figure 5 compares the t-SNE of representations learned for CIFAR-10 by U-CTRL and MoCoV2, 

5. CONCLUSION AND DISCUSSION

In this work, we proposed an unsupervised formulation of the closed-loop transcription framework (Dai et al., 2022) . We experimentally demonstrate that it is possible to learn a unified representation for both discriminative and generative purposes, resulting in highly structured representations. Further, we show that these two purposes mutually benefit each other in various tasks, e.g., conditional image generation and domain tranfers. Compared to the more specialized representations learned in prior works, our results suggest that such a unified representation has the potential in supporting and benefiting a wider range of new tasks. In future work, we believe the learned representations can be further improved by jointly optimizing the feature representation and feature clusters, as suggested in the original rate reduction paper (Chan et al., 2022) . Features with high likelihood of belonging to the same cluster can be further linearized and compressed. Due to its unifying nature and the simplicity of the underlying concepts, this new framework may be extended beyond image data, such as sequential or dynamical observations. z ∈ R 1×1×nz Linear(nz, number of class) For CIFAR-10, CIFAR-100, and Tiny ImageNet, we train our framework with a batch size of 1024 over 20,000 iterations. All experiments are conducted with at most 4 RTX 3090 GPUs. Methods that are compared against in Table 3 are trained with the batch size of 256, because Chen et al. (2020b) observe that purely discriminative methods tend to perform better with smaller batch sizes. Table 2 methods have used their optimal parameters in their github code. For training of the MLP head for unsupervised conditional image generation(10), we again use Adam (Kingma & Ba, 2014) as our optimizer with hyperparameters β 1 = 0.5, β 2 = 0.999. We choose the learning rate to be 0.0001 and ϵ 2 as 0.2, with batch size 1024 over 5000 iterations. For training of the linear classifier, we use Adam (Kingma & Ba, 2014) as our optimizer with hyperparameters β 1 = 0.5, β 2 = 0.999. We choose learning rate to be 0.0001, with batch size 1024 over 5000 iterations.

B.1 CLUSTER RECONSTRUCTION

In this subsection, we visualize the reconstruction of ten clusters that are predicted and generated by U-CTRL on the CIFAR-10 training set. Each block in Figure 7 contains both a random sample of reconstructed data in a cluster and the total number of samples within it. Note that CIFAR-10 contains 50,000 training samples, split across 10 classes. As we see in Figure 7 , the number of samples in each cluster are very close to 5,000, with the largest deviator (cluster 9) containing 3,942 samples. Without any cues, one can easily identify correspond each unsupervised cluster with a CIFAR-10 class. For a class like 'bird', we observe that the model is able to group images of standing birds, flying birds, and bird heads, despite their visual differences.

B.2 UNSUPERVISED CONDITIONAL IMAGE GENERATION

Building on U-CTRL's ability to cluster CIFAR-10 samples, we demonstrate the model's ability to perform unsupervised conditional image generation in Figure 8 . In contrast to reconstruction, where images are regenerated from features corresponding to real samples, we generate images based on the feature sampling technique proposed in (Dai et al., 2022) . From these results, we observe that the U-CTRL framework maintains in-cluster diversity, and that the diversity can be recovered and visualized via simple principal component analysis.

C ABLATION STUDIES C.1 THE IMPORTANCE OF EACH TERM IN U-CTRL FORMULATION

In this section, we study the significance of the sample-wise constraints and extra rate distortion term in the formulation 7. Table 10 presents the following objectives that we study: • Objective I is the constrained U-CTRL maximin 7. • Objective II is the constrained maximin without the augmentation compression constraint 6. • Objective III is the constrained maximin without the sample-wise self-consistency constraint 5. • Objective IV is the constrained maximin without the extra rate distortion term. • Objective V is the U-CTRL without the augmentation compression constraint and sample-wise self-consistency constraint. • Objective VI is the CTRL-Binary maximin formulation 4. Table 11 shows the result of a linear probe for representations trained using each objective on CIFAR-10. From the table, it is evident that both constraints and the rate distortion term are pivotal to the success of our framework. Table 12 shows that U-CTRL without the MCR 2 not only learns worse representation but also generalizes worse to out of distribution data. It follows our discussion in the introduction that discriminative tasks and generative tasks together learn feature that benifits each other.

D RANDOM SEED SENSITIVITY

In this section, we verify the stability of our method against different random seeds. We report in Table 13 the accuracy of U-CTRL on CIFAR-10 with different seeds. We observe that the choice of seed has very little impact on performance. 

E MORE COMPARISON ON T-SNE

Due to limited space in the main body, we present a comparison of t-SNE between u-CTRL and other discriminative-based methods in this section. As shown in Figure 10 , u-CTRL enjoys more structured representation comparing to purely discriminative methods.



Notice that computing the rate reduction terms ∆R for all samples or a batch of samples requires computing the expensive log det of large matrices. In practice, from the geometric meaning of ∆R for two vectors, ∆R can be approximated with an ℓ norm or the cosine distance between two vectors.



Figure1: Overall framework of closed-loop transcription for unsupervised learning. Two additional constraints are imposed on the Binary-CTRL method proposed in prior work(Dai et al., 2022): 1) self-consistency for sample-wise features z i and ẑi , say z i ≈ ẑi ; and 2) invariance/similarity among features of augmented samples z i and z i a , say z i ≈ z i a = f (τ (x i ), θ), where x i a = τ (x i ) is an augmentation of sample x i via some transformation τ (•).

Figure 2: Emergence of block-diagonal structures of |Z ⊤ Z| in the feature space for CIFAR-10 (left), 10 random classes from CIFAR-100 (middle), and 10 random classes from Tiny ImageNet (right).

Figure 4: Unsupervised conditional image generation from each cluster of CIFAR-10, using U-CTRL. Images from different rows mean generation from different principal components of each cluster. Method CIFAR-10 CIFAR-100(20) NMI Accuracy FID↓ IS↑ NMI Accuracy FID↓ IS↑ GAN based methods Self-Conditioned GAN (Liu et al., 2020) 0.333 0.117 18.0 7.7 0.214 0.092 24.1 5.2 SLOGAN (Hwang et al., 2021) 0.340 -20.6 -----VAE based methods GMVAE(Dilokthanakul et al., 2016) -0.247 ------Variational Clustering -0.445 ------CTRL based methods U-CTRL (ours) 0.658 0.799 17.4 8.1 0.374 0.433 20.1 7.7

Figure 5: t-SNE visualizations of learned features of CIFAR-10 with different models.

Figure 7: More result on the reconstruction of clusters in CIFAR-10

max θ minη R(Z) + ∆R(Z, Ẑ) s.t. i∈N ∆R(z i , ẑi ) = 0, and i∈N ∆R(z i , z i a ) = 0 Objective II:max θ minη R(Z) + ∆R(Z, Ẑ) s.t. i∈N ∆R(z i , ẑi ) = 0 Objective III: max θ minη R(Z) + ∆R(Z, Ẑ) s.t. i∈N ∆R(z i , z i a ) = 0 Objective IV: max θ minη ∆R(Z, Ẑ) s.t. i∈N ∆R(z i , ẑi ) = 0, and i∈N ∆R(z i , z i a ) = 0 Objective V:max θ minη R(Z) + ∆R(Z, Ẑ) Objective VI: max θ minη ∆R(Z, Ẑ)

Figure 8: Unsupervised conditional image generation on CIFAR-10. Each block represents a cluster, within which each row represents one principal component direction in the cluster, and samples along each row represent different noises applied in that principal direction.C.2 THE IMPORTANCE OF MCR 2 IN U-CTRL FORMULATIONIn this section, we verify the significance of MCR 2 term ∆R(Z, Ẑ) in our method. We do ablation study on CIFAR-10 with the same network and training condition. If we take away MCR 2 from our

Figure 9: Visualization of images trained by U-CTRL-noMCR 2 : X and reconstructed X on CIFAR-10 dataset.

Comparison of the downstream task capabilities of different unsupervised learning methods. UCIG refers to Unsupervised Conditional Image Generation(Hwang et al., 2021).

Comparison of classification accuracy on CIFAR-10, CIFAR-100, and Tiny-ImageNet with other generative self-supervised learning methods. U-CTRL is clearly better.

Comparison

Comparison of the quality of UCIG on CIFAR-10 and CIFAR-100(20). Many of the methods compared do not provide code that scales up to CIFAR-100(20), in which case we leave the corresponding table cell blank.

Network architecture of the linear classifier.

Network architecture of the MLP head for unsupervised conditional image generation

Five different objective functions for U-CTRL.

Ablation study on varying random seeds.

ETHICS STATEMENT

All authors agree and will adhere to the conference's Code of Ethics. We do not anticipate any potential ethics issues regarding the research conducted in this work.

REPRODUCIBILITY STATEMENT

Settings and implementation details of network architectures, optimization methods, and some common hyper-parameters are described in the Appendix A. We will also make our source code available upon request by the reviewers or the area chairs.

A TRAINING DETAILS

A.1 NETWORK ARCHITECTURES Table 6, 7 and Figure 6 give details on the network architecture for the decoder and the encoder networks used for experiments. The black rectangle marked with "conv, s=2" means a convlutional layer with stride 2. The orange rectangle marked with "dconv, s=2" means a deconvolutional layer with stride 2. The "x k" besides red frame means we regard these layers in red frame as a block and stack it k times. All α values in Leaky-ReLU (i.e. lReLU) of the encoder are set to 0.2. We set (nz = 128, nc = 3, k = 3) for CIFAR-10, (nz = 256, nc = 3, k = 4) for CIFAR-100, and (nz = 256, nc = 3, k = 4) for Tiny-ImageNet. As a comparison, ResNet-18 contains around 11 million parameters, whereas our encoder only contains between 4 and 6 million parameters depending on the choice of k. For all experiments, we use Adam (Kingma & Ba, 2014) as our optimizer, with hyperparameters β 1 = 0.5, β 2 = 0.999. The learning rate is set to be 0.0001. We choose ϵ 2 = 0.2. For all experiments, we adopt augmentation from SimCLR (Chen et al., 2020b) . 

