CLOSED-LOOP TRANSCRIPTION VIA CONVOLUTIONAL SPARSE CODING

Abstract

Autoencoding has been a popular and effective framework for learning generative models for images, with much empirical success. Autoencoders often use generic deep networks as the encoder and decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we replace the encoder and decoder with standard convolutional sparse coding and decoding layers, obtained from unrolling an optimization algorithm for solving a (convexified) sparse coding program. Furthermore, to avoid computational difficulties in minimizing distributional distance between the real and generated images, we utilize the recent closed-loop transcription (CTRL) framework that maximizes the rate reduction of the learned sparse representations. We show that such a simple framework demonstrates surprisingly competitive performance on large datasets, such as ImageNet-1K, compared to existing autoencoding and generative methods under fair conditions. Even with simpler networks and less computational resources, our method demonstrates high visual quality in regenerated images with good sample-wise consistency. More surprisingly, the learned autoencoder generalizes to unseen datasets. Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, scalability to large datasets -indeed, our method is the first sparse coding generative method to scale up to ImageNet -and trainability with smaller batch sizes.

1. INTRODUCTION

In recent years, deep networks have been widely used to learn generative models for real images, via popular methods such as generative adversarial networks (GAN) (Goodfellow et al., 2020) , variational autoencoders (VAE) (Kingma & Welling, 2013) , and score-based diffusion models (Hyvärinen, 2005; Sohl-Dickstein et al., 2015; Ho et al., 2020) . Despite tremendous empirical successes and progress, these methods typically use empirically designed, or generic, deep networks for the encoder and decoder (or discriminator in the case of GAN). As a result, how each layer generates or transforms imagery data is not interpretable, and the internal structures of the learned representations remain largely unrevealed. Further, the true layer-by-layer interactions between the encoder and decoder remain largely unknown. These problems often make the network design for such methods uncertain, training of such generative models expensive, the resulting representations hidden, and the images difficult to be conditionally generated. The recently proposed closed-loop transcription (CTRL) (Dai et al., 2022b) framework aims to learn autoencoding models with more structured representations by maximizing the information gain, say in terms of the coding rate reduction (Ma et al., 2007; Yu et al., 2020) of the learned features. Nevertheless, like the aforementioned generative methods, CTRL uses two separate generic encoding and decoding networks which limit the true potential of such a framework, as we will discuss later. On the other side of the coin, in image processing and computer vision, it has long been believed and advocated that sparse convolution or deconvolution is a conceptually simple generative model for natural images (Monga et al., 2021) . That is, natural images at different spatial scales can be explicitly modeled as being generated from a sparse superposition of a number of atoms/motifs, known as a (convolution) dictionary. There has been a long history in image processing of using such models for applications such as image denoising, restoration, and superresolution (Elad & Aharon, 2006; Elad, 2010; Yang et al., 2010) . Some recent literature has also attempted to use sparse convolution as building blocks for designing more intepretable deep networks (Sulam et al., 2018) . One conceptual benefit of such a model is that the encoding and decoding can be interpreted as mutually invertible (sparse) convolution and deconvolution processes, respectively, as illustrated in Figure 1 right. At each layer, instead of using two separate convolution networks with independent parameters, the encoding and decoding processes share the same learned convolution dictionary. This has been the case for most aforementioned generative or autoencoding methods. Hence, in this paper, we try to investigate and resolve the following question: can we use convolutional sparse coding layers to build deep autoencoding models whose performance can compete with, or even surpass, that of tried-and-tested deep networks? Although earlier attempts to incorporate such layers within the GAN and VAE frameworks have not resulted in competitive performance, the invertible convolutional sparse coding layers are naturally compatible with the objectives of the recent closed-loop transcription CTRL framework (Dai et al., 2022b) , see Figure 1 left. CTRL utilizes a self-critiquing sequential maximin game (Pai et al., 2022) between the encoder and decoder to optimize the coding rate reduction of the learned internal (sparse) representations. Such a selfcritiquing mechanism can effectively enforce the learned convolution dictionaries, now shared by the encoder and decoder at each layer, to strike a better tradeoff between coding compactness and generation quality. The closed-loop framework also avoids any computational caveats associated with frameworks such as GAN and VAE for evaluating (or approximating) distribution distances in the pixel space. As we will show in this paper, by simply using invertible convolutional sparse coding layersfoot_0 within the CTRL framework, the performance of CTRL can be significantly improved compared to using two separate networks for the encoder and decoder. For instance, as observed in Dai et al. (2022b) , the autoencoding learned by CTRL successfully aligns the distributions of the real and generated images in the feature space, but fails to achieve good sample-wise autoencoding. In this work, we show that CTRL can now achieve precise sample-wise alignment with the convolutional sparse coding layers. In addition, we show that deep networks constructed purely with convolutional sparse coding layers yield superior practical performance for image generation, with fewer model parameters and less computational cost. Our work provides compelling empirical evidence which suggests that a multi-stage sparse (de)convolution has the potential to serve as a backbone operator for image generations. To be more specific, we will see through extensive experiments that the proposed simple closed-loop transcription framework with transparent and interpretable sparse convolution coding layers enjoy the following benefits: 1. Good performance on large datasets. Compared to previous sparse coding based generative or autoencoding methods, our method scales well to large datasets such as ImageNet-1k,



Based on the implementation suggested by Zeiler et al. (2010).



Figure 1: Left: A CTRL architecture with convolutional sparse coding layers in which the encoder and decoder share the same convolution dictionaries. Right: the encoder of each convolutional sparse coding layer is simply the unrolled optimization for convolutional sparse coding (e.g. ISTA/FISTA).

Despite their simplicity and clarity, most sparse convolution based deep models are limited to tasks like image denoising (Mohan et al., 2019) or image restoration (Lecouat et al., 2020). Their empirical performance on image generation or autoencoding tasks has not yet been shown as competitive as the above mentioned methods (Aberdam et al., 2020), in terms of either image quality or scalability to large datasets.

