STRUCTURE BY ARCHITECTURE: STRUCTURED REP-RESENTATIONS WITHOUT REGULARIZATION

Abstract

We study the problem of self-supervised structured representation learning using autoencoders for downstream tasks such as generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance typically observed in VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, thereby ordering the information without any additional regularization or supervision. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.

1. INTRODUCTION

Deep learning has achieved strong results on a plethora of challenging tasks. However, performing well on a carefully tuned or synthetic datasets is usually insufficient to transfer to real-world problems (Tan et al., 2018; Zhuang et al., 2019) and directly collecting real labeled data is often prohibitively expensive. This has led to a particular interest in learning representations without supervision and designing inductive biases to achieve useful structure in the representation to help with downstream tasks (Bengio et al., 2013; Tschannen et al., 2018; Bengio et al., 2017; Tschannen et al., 2018; Radhakrishnan et al., 2018) . Autoencoders (Vincent et al., 2008) have become the de-facto standard for representation learning largely due the simple self-supervised training objective paired with the enormous flexibility in the model structure. In particular, variational autoencoders (VAE) (Kingma & Welling, 2013a) have become a popular framework which readily lends itself to structuring representations by augmenting the objective or complex model architectures to achieve desirable properties such as matching a simple prior distribution for generative modeling (Kingma & Welling, 2013a; Vahdat & Kautz, 2020; Preechakul et al., 2022 ), compression (Yang et al., 2022) , or disentangling the underlying factors of variation (Locatello et al., 2018; Khrulkov et al., 2021; Shu et al., 2019; Chen et al., 2020; Nie et al., 2020; Mathieu et al., 2019; Kwon & Ye, 2021; Shen et al., 2020) . However, VAEs exhibit several well-studied practical limitations such as posterior collapse (Lucas et al., 2019b; Hoffman & Johnson, 2016; Vahdat & Kautz, 2020) , holes in the representation (Li et al., 2021; Stühmer et al., 2020) , and blurry generated samples (Kumar & Poole, 2020; Higgins et al., 2017; Burgess et al., 2018; Vahdat & Kautz, 2020) . Consequently, we seek a representation learning method which serves as a drop-in replacement for VAEs while improving performance on common tasks such as generation and disentanglement. One promising direction of the problem is causal modeling (Pearl, 2009; Peters et al., 2017; Louizos et al., 2017; Mitrovic et al., 2020; Shen et al., 2020) which formalizes potential non-trivial structure of the generative process using structural causal models (see appendix A.1 for further discussion). This powerful and exciting framework serves as inspiration for our main contributions:

