CLOSED-LOOP TRANSCRIPTION VIA CONVOLUTIONAL SPARSE CODING

Abstract

Autoencoding has been a popular and effective framework for learning generative models for images, with much empirical success. Autoencoders often use generic deep networks as the encoder and decoder, which are difficult to interpret, and the learned representations lack clear structure. In this work, we replace the encoder and decoder with standard convolutional sparse coding and decoding layers, obtained from unrolling an optimization algorithm for solving a (convexified) sparse coding program. Furthermore, to avoid computational difficulties in minimizing distributional distance between the real and generated images, we utilize the recent closed-loop transcription (CTRL) framework that maximizes the rate reduction of the learned sparse representations. We show that such a simple framework demonstrates surprisingly competitive performance on large datasets, such as ImageNet-1K, compared to existing autoencoding and generative methods under fair conditions. Even with simpler networks and less computational resources, our method demonstrates high visual quality in regenerated images with good sample-wise consistency. More surprisingly, the learned autoencoder generalizes to unseen datasets. Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, scalability to large datasets -indeed, our method is the first sparse coding generative method to scale up to ImageNet -and trainability with smaller batch sizes.

1. INTRODUCTION

In recent years, deep networks have been widely used to learn generative models for real images, via popular methods such as generative adversarial networks (GAN) (Goodfellow et al., 2020) , variational autoencoders (VAE) (Kingma & Welling, 2013) , and score-based diffusion models (Hyvärinen, 2005; Sohl-Dickstein et al., 2015; Ho et al., 2020) . Despite tremendous empirical successes and progress, these methods typically use empirically designed, or generic, deep networks for the encoder and decoder (or discriminator in the case of GAN). As a result, how each layer generates or transforms imagery data is not interpretable, and the internal structures of the learned representations remain largely unrevealed. Further, the true layer-by-layer interactions between the encoder and decoder remain largely unknown. These problems often make the network design for such methods uncertain, training of such generative models expensive, the resulting representations hidden, and the images difficult to be conditionally generated. The recently proposed closed-loop transcription (CTRL) (Dai et al., 2022b) framework aims to learn autoencoding models with more structured representations by maximizing the information gain, say in terms of the coding rate reduction (Ma et al., 2007; Yu et al., 2020) of the learned features. Nevertheless, like the aforementioned generative methods, CTRL uses two separate generic encoding and decoding networks which limit the true potential of such a framework, as we will discuss later. On the other side of the coin, in image processing and computer vision, it has long been believed and advocated that sparse convolution or deconvolution is a conceptually simple generative model for natural images (Monga et al., 2021) . That is, natural images at different spatial scales can be explicitly modeled as being generated from a sparse superposition of a number of atoms/motifs, known as a (convolution) dictionary. There has been a long history in image processing of using such models for applications such as image denoising, restoration, and superresolution (Elad & Aharon, 2006; Elad, 2010; Yang et al., 2010) . Some recent literature has also attempted to use sparse convolution as building blocks for designing more intepretable deep networks (Sulam et al., 2018) . One conceptual benefit of such a model is that the encoding and decoding can be interpreted as mutually invertible

