STRUCTURE BY ARCHITECTURE: STRUCTURED REP-RESENTATIONS WITHOUT REGULARIZATION

Abstract

We study the problem of self-supervised structured representation learning using autoencoders for downstream tasks such as generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance typically observed in VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, thereby ordering the information without any additional regularization or supervision. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.

1. INTRODUCTION

Deep learning has achieved strong results on a plethora of challenging tasks. However, performing well on a carefully tuned or synthetic datasets is usually insufficient to transfer to real-world problems (Tan et al., 2018; Zhuang et al., 2019) and directly collecting real labeled data is often prohibitively expensive. This has led to a particular interest in learning representations without supervision and designing inductive biases to achieve useful structure in the representation to help with downstream tasks (Bengio et al., 2013; Tschannen et al., 2018; Bengio et al., 2017; Tschannen et al., 2018; Radhakrishnan et al., 2018) . Autoencoders (Vincent et al., 2008) have become the de-facto standard for representation learning largely due the simple self-supervised training objective paired with the enormous flexibility in the model structure. In particular, variational autoencoders (VAE) (Kingma & Welling, 2013a) have become a popular framework which readily lends itself to structuring representations by augmenting the objective or complex model architectures to achieve desirable properties such as matching a simple prior distribution for generative modeling (Kingma & Welling, 2013a; Vahdat & Kautz, 2020; Preechakul et al., 2022 ), compression (Yang et al., 2022) , or disentangling the underlying factors of variation (Locatello et al., 2018; Khrulkov et al., 2021; Shu et al., 2019; Chen et al., 2020; Nie et al., 2020; Mathieu et al., 2019; Kwon & Ye, 2021; Shen et al., 2020) . However, VAEs exhibit several well-studied practical limitations such as posterior collapse (Lucas et al., 2019b; Hoffman & Johnson, 2016; Vahdat & Kautz, 2020) , holes in the representation (Li et al., 2021; Stühmer et al., 2020) , and blurry generated samples (Kumar & Poole, 2020; Higgins et al., 2017; Burgess et al., 2018; Vahdat & Kautz, 2020) . Consequently, we seek a representation learning method which serves as a drop-in replacement for VAEs while improving performance on common tasks such as generation and disentanglement. One promising direction of the problem is causal modeling (Pearl, 2009; Peters et al., 2017; Louizos et al., 2017; Mitrovic et al., 2020; Shen et al., 2020) which formalizes potential non-trivial structure of the generative process using structural causal models (see appendix A.1 for further discussion). This powerful and exciting framework serves as inspiration for our main contributions: • We propose an architecture called the Structural Autoencoder (SAE), where the structural decoder infuses latent information one variable at a time to induce an intuitive ordering of information in the representation without supervision. • We provide a sampling method, called hybrid sampling which relies only on independence between latent variables, rather than imposing a prior latent distribution thereby enabling more expressive representations. • We investigate the generalization capabilities of the encoder and decoder separately to better motivate the SAE architecture and to assess how the learned representation of an autoencoder can be adapted to novel factors of variation. 1 et al., 2020; Zhou et al., 2020) . Although this structure is convenient for generative modeling and even tends to disentangle the latent space to some extent, it comes at the cost of somewhat blurry images due to posterior collapse and holes in the latent space. To mitigate the double-edged nature of VAEs (Makhzani & Frey, 2017; Lin et al., 2022) , less aggressive regularization techniques have been proposed such as the Wasserstein Autoencoder (WAE), which focuses on the aggregate posterior (Tolstikhin et al., 2018) . Unfortunately, WAEs generally fail to produce a particularly meaningful or disentangled latent space (Rubenstein et al., 2018) , unless some supervision is available (Han et al., 2021) . Meanwhile, a complementary approach to carefully adjusting the training objective is designing a model architecture beyond the conventional feed-forward style to induce a hierarchy in the representation. For example, the variational ladder autoencoder (VLAE) (Zhao et al., 2017) separates the latent space into separate chunks each of which is processed at different levels of the encoder and decoder (called "rungs"). However, due to the regularization, VLAEs suffer from the same tradeoffs as conventional VAEs. Further architectural improvements such as FiLM (Perez et al., 2018) or Ada-In layers (Karras et al., 2019) readily learn more complex relationships resulting in more expressive models.

2. METHODS

Given (high-dimensional) X = (X 1 , . . . , X D ), the encoder learns to encode the observations into a low-dimensional representation U = (U 1 , . . . , U d ) (d ≪ D), and the decoder models the generative process that produced X from the latent variables to produce reconstruction X. To structure the representation, without loss of generality, we may imagine the true generative process to follow some graphical structure between the (unknown) true factors of variation. If the graphical structure of the true generative process was known, the topology of the decoder could be shaped accordingly (Kipf & Welling, 2016; Zhang et al., 2019; Lin et al., 2022; Yang et al., 2021) . However, in the fully unsupervised setting, we seek a sufficiently general structure on top of the uninterpretable hidden layers to make the representation more structured and interpretable.

2.1. STRUCTURAL DECODERS

Structural decoders integrate information into the generative process one latent variable at a time, resulting in the following structure: S i := f i (S i-1 , U i ), (i = 1, . . . , d), where S i is a feature map (S 0 are static sample-independent trainable parameters) that depends on their noise U i , their parent S i-1 and indirectly any ancestors S j for j < i -1 through S i-1 .

